\documentclass[12pt]{article}
\usepackage{layout, latexsym, array, enumerate, amsmath, amsthm,amssymb,
amsfonts}
\usepackage[mathscr]{eucal}
\usepackage{epsf,epsfig}
\textwidth 6.5in \textheight 9.00in \oddsidemargin -0.15in
\evensidemargin -0.15in \topmargin -0.25in
\newtheorem{theorem}{Theorem}[section]
\newtheorem{corollary}{Corollary}[theorem]
\newtheorem{example}{Example}[section]
\newtheorem{lemma}{Lemma}[section]
\newtheorem{defn}{Definition}[section]
\newcommand{\lowtilde}[1]{\mathop{#1}\limits_{\textstyle\tilde{}}}
%\renewcommand{\baselinestretch}{1.4}


\begin{document}
\title{Local Likelihood and the EMS Algorithm}
\author{Chun-Po Steve Fan\footnote{Chun-Po Steve Fan is a doctoral student at the School of Public Health, University of Toronto} ~ and Jamie Stafford\\University of Toronto}
\thispagestyle{empty}
\date{}
\maketitle
%\renewcommand{\baselinestretch}{1.4}
\begin{abstract}
The use of local likelihood methods (Hastie and Tibshirani 1988,
Loader 1999) in the presence of data that are either interval/area
censored, or has been aggregated into bins, leads naturally to the
consideration of EM-type strategies, or rather local-EM algorithms.
We begin by exploring a class of local-EM algorithms for density
estimation (Braun, Duchense and Stafford 2005) where one member of
this class retains the simplicity and interpretive appeal of the
usual kernel density estimate. We demonstrate that using a
particular conditional distribution at the E-step results in the
algorithm collapsing explicitly into an EMS algorithm of the type
considered by Silverman, Jones, Nychka and Wilson (1990). This is
true for the entire local likelihood class and may be extended to
other classes. In particular we consider the local likelihood class
for intensity estimation and generalize the self-consistency
algorithm of Hu, Lagakos and Lockhart (2008) for panel count data.

The advantages of identifying a relationship between local
likelihood and the EMS algorithm is that the former provides a
natural context for the latter, which is often referred to as ad hoc
in the literature, while the latter provides a set of tools to guide
the use, and implementation, of local-EM algorithms. For example, we
expose a previously unknown connection between local-EM algorithms
and penalized likelihood that is analogous to the more familiar
pairing of EM and likelihood.
\end{abstract}

\vspace{10pt} \small \it Keywords: density estimation; intensity
estimation; interval/area censoring; local-EM; panel counts; penalized
likelihood; self-consistency \rm \normalsize

\section{Introduction}
In this paper we consider extending the methods of Braun, Duchesne
and Stafford (2005) for density estimation to situations where an
intensity function is the object of interest. Developments eventually
lead to a generalization of the self-consistency algorithm of Hu, Lagakos and Lockhart (2008). Here data
may be interval censored, it may be temporal and come in the form of panel counts, or it may be spatial and area censored. Whatever form the censoring takes, the use of local likelihood techniques naturally leads to the consideration of local-EM algorithms.

In situations where data are interval censored, Braun, Duchesne and
Stafford (2005) proposed a naive kernel density estimator, embedded
this estimator in a local likelihood class where it was recognized as a local-EM
algorithm, and demonstrated that it was a generalization of the
self-consistency algorithm of Li \emph{et al.}\ (1997). However,
despite these interesting developments, the convergence of the local-EM
algorithm was difficult to demonstrate, and it was not clear whether
its fixed point maximized any particular criterion.

In this paper, we consider simplifying the E-step of the local-EM
algorithm in Braun \emph{et al.}\ (2005) by approximating conditional expectations using a
piecewise constant function. This results in the local-EM
algorithm collapsing explicitly into an EMS algorithm where a
smoothing step is added to the expectation and maximization steps of
the usual EM algorithm (Silverman, Jones, Nychka and Wilson, 1990).
This has two advantages. First, this simplification embeds the EMS algorithm in the
local likelihood context in which it is seen to arise naturally from
the consideration of EM-type strategies. Silverman \emph{et al.}\ (1990) had originally referred to the EMS algorithm as
\emph{ad hoc} but its relationship to local likelihood suggests
otherwise\footnote[1]{Nychka (1990) demonstrates that a modified EMS
algorithm is related to penalized likelihood. As a result, he also
suggests that the EMS algorithm is not \emph{ad hoc}.}. Secondly, the EMS
algorithm has been extensively studied, and much is known about its
convergence (Latham, 1994, 1995, 1996) and its relationship to
penalized likelihood (Nychka, 1990). The latter suggests a
previously unknown connection between local-EM and penalized likelihood, analogous to the more familiar pairing of EM
and likelihood.

For the sake of clarity we initially separate the
treatment of density and intensity estimation although \S 2
provides the notation used for both cases. In \S 3, we return to the
setting of Braun, Duchesne and Stafford (2005) and demonstrate that
simplifying the E-step in their local EM algorithm results in an
EMS or adaptive EMS algorithm. Here the relationship with the EM
algorithm of Turnbull (1976) becomes explicit.  Furthermore, we explore
convergence properties using the results of Latham (1994, 1995, 1996). The
treatment of intensity estimation in \S 4 resembles the details of
\S 3 due to similarities in local likelihood functions; however the context
is much broader. Here we focus on the temporal context, in which data
come in the form of panel counts, but the algorithm derived are virtually
identical to what would be used in the spatial context in which data
may be area censored. A full treatment of the spatial context in an
epidemiological setting may be found in Brown, Fan and Stafford
(2008). A brief simulation study is given in \S 5.

In \S 6, we present results meant to strengthen the suggestion that,
for the context considered in this paper, local-EM and penalized
likelihood may be paired in a manner analogous to the pairing of EM
and likelihood.  We first present a version of local-EM algorithm equivalent to Nychka's modified EMS that maximizes a penalized likelihood function. (Nychka, 1990).  Next, we show that EMS iterates converge uniformly to its local-EM counterparts.  These two results suggest that local-EM and EMS techniques may be thought of synonymously. Finally, we investigate the penality function induced by the kernel and give concluding remarks in \S7.

% In \S 6.1 we demonstrate that the use of an equivalent
% kernel in a local-EM algorithm leads to the modification necessary
% to maximize of a penalized likelihood function (Nychka, 1990). In \S 6.2 we prove
% the uniform
% convergence of the EMS iterate to its local-EM counterpart. This,
% and the developments of \S 3 \& 4, suggest local-EM and EMS
% techniques may be thought of synonymously. Finally, in \S 6.3 we study
% the penalty of \S 6.1
% under the conditions of \S 6.2. Final remarks are given in \S 7.

\section{Some initial details}

In the context of this paper we assume a study consisting of $n$
independent subjects where the time of an event, or a recurring
event, is under investigation. Each subject is observed at a set of times
${\cal T}_i=\{\tau_{i1},\ldots,\tau_{ik_i}\}$ determined by a visit
process assumed to be independent of the event time process. The
event time process is assumed to be either a failure time process or
a nonhomogeneous Poisson process. In either case, the counting
process $N_i(t)$ gives the number of events up to time $t$ for the
$i$th subject. Interest lies in the expected value
$$
\Lambda(t)=E[N_i(t)],
$$
which is assumed to be common to each subject in the study. When the
event time process is a failure time process, $N_i(t)$ is an
indicator process for the failure time $T_i$, and $\Lambda(t)$ is
the cumulative distribution function for $T_i$ with density $f$.
Otherwise $\Lambda(t)$ is a cumulative intensity function for
Poisson process data with intensity $\lambda$.

In the case of a failure time process, the event time $T_i$ is
assumed to either fall between two adjacent elements of ${\cal T}_i$
or to be right censored. In either case, $T_i$ is interval censored, and we denote the relevant interval as $I_i=[L_i,R_i]$ where
$L_i,R_i\in{\cal T}_i\cup \{\infty\}$, and of course $T_i\in I_i$.
Note that if the event time for the $i$th subject is right
censored, we set $R_i=\infty$. The observed data are then a sequence
of independent intervals $I_1,\ldots,I_n$, some of which may overlap.
We let ${\cal J}=\{J_1,\ldots,J_k\}$ denote the partition of the
data defined by the collection of endpoints
$\{L_i,R_i;i=1,\ldots,n\}$ and $|\!|J_j|\!|$ denotes the length of $J_j$. For example, if $n=2$ and
$I_1=[0,3],~I_2=[1,2]$ then we would have ${\cal
J}=\{[0,1],[1,2],[2,3]\}$. We denote the indicator function of the
set $I_i\cap J_j$ by ${\cal I}_{ij},$ and for the purposes of density
estimation we define $g$ as a piecewise constant function in which
$g(t)=p_j/||J_j||$ for $t\in J_j$ with $p_j=\int_{J_j} f(u)\, \mathrm{d}u$.
This piecewise constant function $g(t)$ may be formally
referred to as the ${\cal J}$-approximant of $f$ (Royden, 1988).
Finally, we denote the collection of all $p_j$'s by the vector $\bf
p$.

In the case of a Poisson process, event times are still interval
censored, but we may have multiple events in each interval. For
subject $i$, we denote the times of these events as $T_{ij}$, where
$j=1,\ldots, n_i$,  and $n_i$ as the total number of events observed
for the $i$th subject. In this setting,
$I_{ij}=[\tau_{ij-1},\tau_{ij}]$ is referred to as the $j$th
panel for the $i$th individual, and we denote the number of events
in the interval $I_{ij}$ by
$n_{ij}=N_i(\tau_{ij})-N_i(\tau_{ij-1})$. Following the setup of Hu,
Lagakos and Lockhart (2008), we let
$${\cal T}=\cup_i^n {\cal T}_i=\{\tau_0,\ldots,\tau_k\}$$ and again
let ${\cal J}=\{J_1,\ldots,J_k\}$ denote a partition of the data,
in which $J_j=[\tau_{j-1},\tau_j]$ and $||J_j|| = \tau_j - \tau_{j-1}$. We denote the indicator
function of the set $I_{ij}\cap J_l$ by ${\cal I}_{ijl}$.  For the
purposes of intensity estimation, we now define $g$ as a piecewise
constant function in which $g(t)=\Lambda_j/||J_j||$ for $t\in J_j$, where
$\Lambda_j=\int_{J_j} \lambda(u)\, \mathrm{d}u$. This function $g(t)$ may
now be formally referred to as the ${\cal J}$-approximant of
$\lambda$. We denote the collection of all $\Lambda_j$'s by the
vector $\bf \Lambda$. Finally, $\tau_{ik_i}$'s are allowed to vary among subjects,
indicating the potential for subjects to have different dropout
times. To accommodate this, we define $Y_i(t)$ as an indicator of
dropout for the $i$th subject and set $Y(t)=\sum_i Y_i(t)$. We
assume a monotone dropout process, which dropout can only occur at a time $T\in {\cal
T}$.

\section{Local likelihood Density Estimation}
Local likelihood techniques for density estimation were pioneered by
Loader (1996) and Hjort and Jones (1996) in the context of
completely observed data. Both propose estimating the unknown
density locally at $t$ by maximizing the criterion
\begin{eqnarray}\label{llde}
{\cal L}(f,t)=\sum_{i=1}^n K_h(T_i-t)\log f(T_i)-n \int_\Re
K_h(u-t)f(u)\, \mathrm{d}u
\end{eqnarray}
over some suitable class of functions. Here $K_h(z)= K(z/h)/h$ and $K$ denotes a kernel function where typically $\int K(z)
\, \mathrm{d}z =1$. Loader (1996) suggests
approximating the log density near $t$ using polynomials, that is,
\begin{eqnarray*}
\log f(u)={\cal P}(u-t)=\sum_{j=0}^p a_j(u - t)^j.
\end{eqnarray*}
Upon substitution of the polynomial expansion into (\ref{llde}), one
may maximize it with respect to ${\bf
a}=\{{a}_{0},{a}_{1},\ldots,{a}_{p}\}$ and set the density estimate
as $ \hat{f}(t)=\exp(\hat{a}_0). $

When the data are interval censored Braun, Duchesne and Stafford
(2005) propose replacing (\ref{llde}) with
\begin{eqnarray}\label{llicdp}
{\cal L}({\bf a},t)=\sum_{i=1}^n E_f[K_h(T_i-t){\cal P}(T_i-t) \mid I_i]-n
\int_\Re K_h(u-t)\exp\{{\cal P}(u-t)\}\, \mathrm{d}u
\end{eqnarray}
and use a local-EM algorithm which cycles through two steps at
each iteration:
\begin{quote}
{\tt E-step}: compute the relevant expectations using the current
estimate of $f$
\newline
{\tt M-step}: maximize (\ref{llicdp}) to get updated estimates of
$\{a_0, a_1, \ldots, a_p\}$ and $f$.
\end{quote}
The algorithm differs from a typical EM algorithm because, at the
E-step, expectation is computed with respect to an estimate of the
infinite dimensional parameter $f$ while, at the M-step, we only
estimate this parameter locally at $t$. As such the typical
arguments concerning the convergence of the EM algorithm cannot be
brought to bear. Furthermore, if the local-EM algorithm converges to
a fixed point $\hat{f}$, it is not clear what criterion this fixed point
optimizes.

\subsection{Local-EM and the EMS algorithm}
In the locally constant case, the approximating polynomial is
truncated at the leading term $a_0$ and the local-EM algorithm
becomes
\begin{eqnarray} \label{e: local_em}
\hat{f}_{r+1}(t) = n^{-1}
\sum_{i=1}^n\mbox{E}_{\hat{f}_{r}} \left[ K_h\left(T_i-t \right) \mid I_i\right].
\end{eqnarray}
Here expectation is computed with respect to the conditional density
\begin{eqnarray} \label{conden_1}
\hat{f}_{r;I_i}(u)=\dfrac{\hat{f}_{r}(u)}{\int_{I_i}{\hat{f}_{r}(t)}\, \mathrm{d}t}
\text{ for $u\in I_i$}
\end{eqnarray}
and the algorithm is iterated until it converges to a fixed point
$\hat{f}$ that solves
\begin{eqnarray}\label{iter}
\hat{f}(t)=n^{-1} \sum_{i=1}^n\mbox{E}_{\hat{f}}\left[\left.
K_h\left({T_i-t}\right)\right|I_i\right] \quad \mbox{for all  $t \in {\cal S}$, }
\end{eqnarray}
where $\cal S$ denotes the support of the unknown density $f$.

This algorithm may be implemented using a discretization of
$f$. This essentially reduces the infinite dimensional estimation
problem of finding $\hat{f}$ to one that has finite dimension. We
begin by approximating $p_j$ with $\hat{p}_{r, j}=\int_{J_j}
\hat{f}_r(t)\, \mathrm{d}t$ and, upon substitution, $g(t)$ with
$\hat{g}_r(t)=\hat{p}_{rj}/|\!|J_j|\!|$ for $t\in J_j$. Now, rather than
using $\hat{f}_{r}$ to compute conditional expectations, consider
simplifying the iteration (\ref{e: local_em}) by using the piecewise
constant
function $\hat{g}_r$. Here the conditional density (\ref{conden_1}) becomes
\[
\hat{g}_{r;I_i}(t)=\dfrac{\hat{p}_{rj} {\cal I}_{ij}}{||J_j||\sum_m
\hat{p}_{rm} {\cal I}_{im}} \text{ for $t \in J_j$,}
\]
and $\mbox{E}_{\hat{f}_{r}}\left[\left.
K_h\left({T_i-t}\right)\right|I_i\right]$ in
(\ref{iter}) is replaced with
\begin{eqnarray*}
\mbox{E}_{\hat{g}_{r}}\left[\left.
K_h\left({T_i-t}\right)\right|I_i\right]&=&\int_{I_i}K_h\left({u-t}\right)\hat{g}_{r;I_i}(u)\, \mathrm{d}u\\
&=&\sum_j{{\hat{p}_{r, j}{\cal I}_{ij}\int_{J_j}K_h\left({u-t}\right)\,
\mathrm{d}u}\over{||J_j||\sum_l \hat{p}_{r, l} {\cal I}_{il}}}
\end{eqnarray*}
In order to proceed to the next iteration, we are required to compute:
\begin{eqnarray*}
\hat{p}_{r+1, j}&=&\int_{J_j} \hat{f}_{r+1}(t)\, \mathrm{d}t\\
%
&=& n^{-1} \int_{J_j} {\Bigg\{}\sum_{il}{{\hat{p}_{rl} {\cal I}_{il}
\int_{J_l} K_h(u-t) \, \mathrm{d}u}\over{ ||J_l||\sum_m
\hat{p}_{rm}{\cal I}_{im}}}{\Bigg \}}\, \mathrm{d}t\\
%
&=& n^{-1} \sum_{il}{{\hat{p}_{r l} {\cal I}_{il} \int_{J_j}
\int_{J_l} K_h(u-t) \, \mathrm{d}u \,
\mathrm{d}t}\over{ |\!|J_l |\!|\sum_m \hat{p}_{r m} {\cal I}_{im}}},
\end{eqnarray*}
and this leads to a simple iteration on the vector of probabilities
$\hat{\bf p}_{r+1}$. This may be expressed in terms of matrices
as
\begin{eqnarray} \label{EMSdc}
\hat{\bf p}_{r+1}={\cal M}(\hat{\bf p}_{r}) {\cal K}_h,
\end{eqnarray}
where ${\cal K}_h$ is a $k\times k$ smoothing matrix with entries
${\cal K}_{lj}= ||J_l||^{-1} \int_{J_j} \int_{J_l}
K_h\left({u-t}\right) \, \mathrm{d}u \, \mathrm{d}t$, and ${\cal
M}(\hat{\bf p}^{r})$ is a $k$ dimensional row vector with $\ell$th
entry
$$
n^{-1} \sum_{i} {{\hat{p}_{r\ell} {\cal I}_{i\ell}}\over{\sum_m
\hat{p}_{rm} {\cal
I}_{im}}}.
$$
The latter is recognized as a step in the EM algorithm of Turnbull
(1976), and hence the iteration (\ref{EMSdc}) is seen to involve an
expectation, maximization \emph{and} smoothing step. That is, our
implementation of the local-EM algorithm (\ref{iter}) has explicitly resulted
 in an EMS algorithm.
%\newpage

\vspace{25pt}
\noindent {\bf General Remarks:}
\begin{itemize}
\item As $h\downarrow 0$ the integral $
\int_{J_l} K_h\left({u-t}\right)\, \mathrm{d}u$ becomes the indicator
function $1_{J_l}(t)$ for $J_l$ and ${\cal K}_{lj} \rightarrow
\delta_{lj}$. Consequently ${\cal K}_h$ converges to the identity
matrix, and the iteration (\ref{EMSdc}) becomes Turnbull's EM
algorithm in a result similar to that of Theorem 1 in Braun,
Duchesne and Stafford (2005). We also note that (\ref{EMSdc}) is not
the first instance of smoothing Turnbull's EM algorithm in the
literature. Pan and Chappell (1998) considers a special case of
directly smoothing Turnbull's EM algorithm, but this paper provides
no discussion of local likelihood nor any insight given into its relationship to the EMS algorithm.
%
\item When the data come in the form of a histogram, the set of observed
intervals and the partition $\cal J$ coincide. As a result,
(\ref{EMSdc}) only iterates once, and the resulting smoothed
histogram has been studied extensively (Jones 1989, Bellhouse and
Stafford 1999). Moreover, the weights used to smooth the histogram are the integrated kernel weights
advocated by a number of authors including Gasser and
M\"{u}ller (1979), Jones (1989) and so on.
\end{itemize}

%\newpage
\vspace{20pt} \noindent {\bf Remark on Convergence:}
\bigskip

\noindent Denote the EMS mapping as $\mathcal{M}_S (\mathbf{p}) = \mathcal{M}(\mathbf{p}) \mathcal{K}_h$. As a consequence of the Perron-Frobenious theorem, this mapping will have a unique fixed point in the interior of the cone $\{{\bf p}\in\Re^k \mid {\bf p}\geq0\}$ (Latham 1994, 1995, 1996; Silverman \emph{et al.}).  The empirical evidence suggests that the EMS iteration (\ref{EMSdc}) converges regardless of the initial estimate and faster convergence with larger bandwidth. Although we cannot prove the convergence of the EMS algorithm in general, we attempt to derive an upper bound for the spectral radius of $\partial \mathcal{M}_S(\mathbf{p})/\partial \mathbf{p}$.  Let $\bar{\mathbf{p}}$ denote the fixed point of  $\mathcal{M}_S (\mathbf{p})$ and $|\!| \cdot |\!|$ be the maximum of column sums.  Then an upper bound for the spectral radius at $\bar{\mathbf{p}}$ is
$$
\left|\! \left| \frac{\partial \mathcal{M}_S(\bar{\mathbf{p}})}{\partial \mathbf{p}} \right|\!\right| = %|\!|\mathcal{K}_h|\!|
\left|\! \left| \frac{\partial \mathcal{M}(\bar{\mathbf{p}})}{\partial \mathbf{p}} \mathcal{K}_h \right|\!\right| \mbox{, where}
$$
$$
\left[\partial \mathcal{M}(\mathbf{p})/\partial \mathbf{p}\right]_{j\ell} =
\begin{cases}
 n^{-1} \sum_i \frac{\mathcal{I}_{ij}  \sum_{m \neq \ell} \mathcal{I}_{im} p_m}{\left(\sum_k \mathcal{I}_{ik} p_k\right)^2} & \mbox{when $j=\ell$,} \\
%& \\
 n^{-1} \sum_i \frac{-\mathcal{I}_{ij} \mathcal{I}_{i\ell} p_\ell}{\left(\sum_k \mathcal{I}_{ik} p_k\right)^2} & \mbox{otherwise.}
\end{cases}
$$
%Hence $\left[ \partial \mathcal{M}_S(\bar{\mathbf{p}})/\partial \mathbf{p} \right]_{j\ell} = \sum_{k} \left[\partial \mathcal{M}(\mathbf{p})/\partial \mathbf{p}\right]_{jk} \left[ \mathcal{K}_h\right]_{k\ell}$


%Denoting $C=\{{\bf p}\in\Re^k \mid {\bf p}\geq0\}$,
%However,  and the rate of convergence is given by the spectral radius of $\partial \mathcal{M}_S(\bar{\mathbf{p}})/\partial \mathbf{p}$. We consider the cases with and without right censoring separately.

%Let $\rho$ denote an arbitary but fixed matrix norm.
When there are no right censored observations ($R_i < \infty$ for all $i$), ${\cal K}_h$ is symmetric, positive, and irreducible with $\left[ \mathcal{K}_h\right]_{k\ell} < 1$ for all $k$ and $\ell$.  Moreover,  the value of the entry $\left[ \mathcal{K}_h\right]_{k\ell}$ becomes smaller as it moves further from the diagnoal.  Follwoing some algebraic manipulations,
\begin{align}
\left|\! \left| \dfrac{\partial \mathcal{M}_S(\bar{\mathbf{p}})}{\partial \mathbf{p}} \right|\!\right|
%&= \max_\ell \sum_j \left[ \partial \mathcal{M}_S(\bar{\mathbf{p}})/\partial \mathbf{p} \right]_{j\ell} \notag \\
% &= \max_{\ell} \sum_k [\mathcal{K}_h]_{k\ell} \left(\sum_j \left[ \partial \mathcal{M}_S(\bar{\mathbf{p}})/\partial \mathbf{p} \right]_{jk}\right) \notag \\
&= n^{-1} \max_{\ell} \sum_k [\mathcal{K}_h]_{k\ell} \sum_i \dfrac{\mathcal{I}_{ij}  \sum_{m \neq k} \mathcal{I}_{im} (\bar{p}_m - \bar{p}_k) }{\left(\sum_m \mathcal{I}_{im} \bar{p}_m\right)^2}. \label{e:upper_bound_spectral}
\end{align}
It is evident that this upper bound (\ref{e:upper_bound_spectral}) is an decreasing function of the bandwidth and an increasing function of the roughness of $\bar{\mathbf{p}}$.  Note that as the bandwidth $h$ increases, $\bar{p}_j$'s become more uniform.  Therefore, the EMS mapping will be a global contraction when $h$ is large enough.
%Here, the matrix $\mathcal{M}(\bar{\mathbf{p}})/\partial \mathbf{p}$ can be written as $(\mathcal{B} + \mathcal{C})^{-1} \mathcal{C}$, where $\mathcal{B}$ and $\mathcal{C}$ are information matrices presented in Green (1990).


%We define $C=\{{\bf p}\in\Re^k \mid {\bf p}\geq0\}$ and denote the EMS mapping as $\mathcal{M}_S (\mathbf{p})$.  Then
% $$
% \mathcal{M}_S (\mathbf{p}): C \to C \mbox{ such that $\mathcal{M}_S (\mathbf{p}) = \mathcal{M}(\mathbf{p}) \mathcal{K}_h.$}
% $$
%If the EMS iteration (\ref{EMSdc}) converge to a fixed point, $\bar{\mathbf{p}} = \mathcal{M}_S (\bar{\mathbf{p}})$.  We might use Ostrowski's theorem to show the local convergence of the EMS algorithm.    then the rate of convergence is determined by the spectral radius of $\partial \mathcal{M}_S(\bar{\mathbf{p}})/\partial \mathbf{p}$. If  $|\!| \partial \mathcal{M}_S(\bar{\mathbf{p}})/\partial \mathbf{p} |\!| \leq |\!|\partial \mathcal{M}(\bar{\mathbf{p}})/\partial \mathbf{p}|\!|$, then the EMS is at least as fast as the EM.  We consider the cases with and without right censoring separately.
%
% If ${\cal K}_h$ is irreducible,  then the iteration (\ref{EMSdc}) will have a unique fixed
% point in the interior of $C$ (Latham 1994, 1995, 1996).

% When there are no right censored observations ($R_i < \infty$ for all $i$), ${\cal K}_h$ is positive and irreducible.  This implies that the iteration (\ref{EMSdc}) will have a unique fixed point in the interior of $C$ as a consequence of the Perron-Frobenious theorem (Latham 1994, 1995, 1996). Moreover, since $\sum_j {\cal K}_{j\ell}<1 $ for all $\ell$, $|\!|\mathcal{K}_h|\!|$ is less than one, and so is its spectral radius. Therefore, a sufficient condition for a convergent EMS algorithm is that%
% % Let $\mathcal{M}_S$ denote the EMS mapping that can be expressed as
% % $$
% % \mathcal{M}_S (\mathbf{p}) = \mathcal{M}(\mathbf{p}) \mathcal{K}_h.
% % $$
% % If the iterations (\ref{EMSdc}) converge to a fixed point, $\bar{\mathbf{p}}$, it can be shown that $|\!| \partial \mathcal{M}_S(\bar{\mathbf{p}})/\partial \mathbf{p} |\!| \leq |\!|\partial \mathcal{M}(\bar{\mathbf{p}})/\partial \mathbf{p}|\!|$ using the upper bound on the convergence rate of the EMS algorithm given in Nychka (1990).  More specifically,
% % It follows
% $$
% |\!| \partial \mathcal{M}_S(\bar{\mathbf{p}})/\partial \mathbf{p} |\!| \leq |\!|\mathcal{K}_h|\!|  |\!|\partial \mathcal{M}(\bar{\mathbf{p}})/\partial \mathbf{p}|\!| < 1.% \leq  |\!|\partial \mathcal{M}(\bar{\mathbf{p}})/\partial \mathbf{p}|\!|.
% $$
% Here, the matrix $\mathcal{M}(\bar{\mathbf{p}})/\partial \mathbf{p}$ can be written as $(\mathcal{B} + \mathcal{C})^{-1} \mathcal{C}$, where $\mathcal{B}$ and $\mathcal{C}$ are information matrices presented in Green (1990). Provided the  fact that $\mathcal{B}$ is positive definite and $\mathcal{C}$ is non-negative definite, $|\!|\partial \mathcal{M}(\bar{\mathbf{p}})/\partial \mathbf{p}|\!| < 1$, thus $|\!| \partial \mathcal{M}_S(\bar{\mathbf{p}})/\partial \mathbf{p} |\!| \leq |\!|\mathcal{K}_h|\!|$.
% % since $\bar{\mathbf{p}}$ may be far away from an EM solution, $\mathbf{p}^{\ast}$.
% % Nonetheless, when the banwidth $h$ is small, $\bar{\mathbf{p}}$ and $\mathbf{p}^{\ast}$ are close, and $\mathcal{M}(\bar{\mathbf{p}}) \approx \mathcal{M}(\mathbf{p}^{\ast})$ since $\mathcal{M}(\mathbf{p})$ is continuous.  Thus, if EM iterations converge, then the iteration (\ref{EMSdc}) can only improve the convergence of the EM algorithm as noted
% % in Silverman \emph{et al.}\ (1990).
% In addition, as $h$ increases,  $|\!|\mathcal{K}_h|\!|$, and hence the spectral radius, both decrease accelerating convergence. This pattern was noted in simulations conducted by Braun, Duchesne and Stafford (2005), but it is now transparent given the relationship between local-EM and EMS.



In the presence of right censored data, the partition $\cal J$ will
have one element $J_k=[\tau_{k-1},\tau_k]$ with
$\tau_k=\infty$. Suppose for the moment we let $\tau_k<\infty$, and we
consider the entries of ${\cal K}_h$ as $\tau_k\rightarrow\infty$. Note
that we have
\[
\begin{split}
\lim_{\tau_{k} \to\infty}\dfrac{\int_{J_l}
\int_{\tau_{k-1}}^{\tau_k} K_h(u-t)\,\mathrm{d}u \,
\mathrm{d}t}{|\!| J_{k} |\!|} &= \int_{J_l} \lim_{\tau_{k} \to
\infty} \dfrac{\int_{\tau_{k-1}}^{\tau_{k}} K_h(u-t) \,
\mathrm{d}u}{| \tau_{k} - \tau_{k-1} |} \, \mathrm{d}t =0, \quad
\mbox{and} \\
%
\lim_{\tau_{k} \to \infty}\dfrac{\int_{\tau_{k-1}}^{\tau_k}
\int_{\tau_{k-1}}^{\tau_k} K_h(u-t)\,\mathrm{d}u \,
\mathrm{d}t}{|\!| J_{k} |\!|} &=
%
\lim_{\tau_{k} \to \infty} \int_{\tau_{k-1}}^{\tau_{k}} \dfrac{ 1
- \int_{-\infty}^{\tau_{k-1}} K_h(u-t) \, \mathrm{d}u}{| \tau_{k} -
\tau_{k-1} |} \, \mathrm{d}t =1 %\\
\end{split}
\]
As a result, ${\cal K}_h$ is a reducible matrix with a normal form
\[
\mathcal{K}_h =
\begin{bmatrix}
\mathcal{K}_h'  & \mathbf{c} \\
\mathbf{0}  & 1
\end{bmatrix}.
\]
%and $J_k$ becomes an absorbing state. To facilitate this complication
However, since $\sum_j p_j =1$, it is sufficient to estimate $p_1, \ldots, p_{k-1}$.  As a result, we eliminate $J_k$ from the partition, iterate (\ref{EMSdc}) using the $(k-1) \times (k-1)$ matrix
$\mathcal{K}_h' $, and set $p_k=1-\sum_{j=1}^{k-1} p_j$ at each iteration. Since $\mathcal{K}_h'$ is symmetric, positive, and irreducible, the remark above applies.  Note that in this case if we consider
$h \downarrow 0$, the new iteration involving $\mathcal{K}_h' $ still
produces Turnbull's estimator.

\vspace{20pt}
%\newpage
\noindent {\bf Remark on Braun, Duchesne and Stafford (2005):}
\bigskip
%\newline

\noindent Although, thus far, the choice of the partition $\cal J$
has depended on the data in a natural way, it is in fact, quite
arbitrary. For instance, we could consider a partition ${\cal J}_k$
based on an equally spaced set of mid-points ${\bf{\cal
T}}=\{t_1,\ldots,t_k\}$ over a finite interval $\cal S$ that
contains the support of the data. Here $\Delta=|\!|{\cal S}|\!|/k$
is the distance between two adjacent elements of $\cal T$. We may
demonstrate that Braun,
Duchesne and Stafford (2005) had originally implemented their
algorithm as an EMS algorithm although they were not aware of it. To see
this, we use their approximation
of $\int_{J_j} f(t)\, \mathrm{d}t$ as $f(t_j)\Delta$ throughout and
the iteration for $\hat{p}_j$ becomes
\begin{eqnarray*}
\hat{p}_{r+1, j}&=&\hat{f}_{r+1}(t_j)\Delta=\int_{J_j}
\hat{f}_{r+1}(t)\, \mathrm{d}t\\
&=& n^{-1} \sum_{il}{{\hat{p}_{rl}{\cal I}_{il} \int_{J_j}
\int_{J_l}K_h\left({u-t}\right) \, \mathrm{d}u \,
\mathrm{d}t}\over{\Delta \sum_m \hat{p}_{rm}
{\cal I}_{im}}}\\
&=&n^{-1} \sum_{il}{{\hat{f}_{r+1}(t_l)\Delta K_h\left({t_l-t_j}\right)
\Delta^2}\over{\Delta\sum_m \hat{f}_{r+1}(t_m)\Delta{\cal
I}_{im}}}\cdot
\end{eqnarray*}
Cancellation of the $\Delta$'s leaves us with expression (15) of
Braun, Duchesne and Stafford (2005). A similar result also holds in the
local linear case. In section \S 6.2 we allow $k \rightarrow \infty$ and present uniform
convergence results.

\subsection{An Adaptive EMS algorithm}

In general when the approximating polynomial is truncated at the
$(p+1)^{st}$ term, the local-EM algorithm becomes
\begin{eqnarray*}
\hat{f}_{r+1}(t)=n^{-1}
\sum_{i=1}^n\mbox{E}_{\hat{f}_{r}}\left[\left. K_h(T_i -
t)\right|I_i\right]/\Psi_h[{\bf \hat{a}}_r(t)],
\end{eqnarray*}
where
\begin{eqnarray*}
\Psi_h(\bf a)&=&\int_\Re K_h(u-t)\exp\left\{ \sum_{j=1}^p a_j(u-t)^j
\right\} \, \mathrm{d}u
\end{eqnarray*}
and ${\bf \hat{a}}_r$ solves the moment matching equations
\begin{eqnarray*}
n^{-1}\sum_{i=1}^n E_{\hat{f}_r}[K_h(T_i-t)(T_i-t)^r |I_i]=\int_\Re
K_h(u-t)(u-t)^r\exp\left\{ \sum \hat{a}_j(u-t)^j \right\} \,
\mathrm{d}u
\end{eqnarray*}
for $r =1, 2, \ldots$. In this context we again consider computing
expectations using the conditional density $g_{r;I_i}$. Here
$E_{\hat{f}_r}[\,\cdot \,|I_i]$ is replaced with
$E_{\hat{g}_r}[\,\cdot |\,I_i]$ in the above equations, and we use
the notation ${\bf \hat{a}}_r \equiv {\bf a}(t; {\bf \hat{p}}_r)$ to
make explicit that the solution will now depend on the iterate
$\hat{\bf p}_r$. After some detail, we have the following iteration
for estimating the $p_j$'s:
\begin{eqnarray*}
\hat{p}_{r+1, j}&=&\int_{J_j} \tilde{f}_{r+1}(t)\, \mathrm{d}t\\
&=&\int_{J_j}\left[\sum_l{{\hat{p}_{rl} {\cal
I}_{il}\int_{J_l}K_h(u-t) \, \mathrm{d}u }\over{||J_j||\sum_m
\hat{p}_{rm} {\cal I}_{im}}}\right] {\Bigg
/}\Psi_h(t; {\bf \hat{a}}_r) \, \mathrm{d}t\\
%
&=&\sum_l{{\hat{p}_{rl}{\cal I}_{il} \int_{J_j}
\int_{J_l}K_h\left({u-t}\right) /\Psi_h(t;{\bf \hat{a}}_r) \,
\mathrm{d}u \, \mathrm{d}t}\over{|\!|J_j|\!|\sum_m \hat{p}_{rm}
{\cal I}_{im}}}~\cdot
\end{eqnarray*}
This may again be written in terms of matrices as
\begin{equation}\label{EMSdho}
\hat{\bf p}_{r+1}={\cal M}(\hat{\bf p}_{r}) {\cal K}_h(\hat{\bf p}_{r}).
\end{equation}
Here $\cal M$ is defined as before, but the smoothing matrix now
has entries,
\[
{\cal K}_{jl}(\hat{\bf p}^{r})= |\!|J_l|\!|^{-1} \int_{J_l}
\dfrac{\int_{J_j} K_h\left({u-t}\right) \, \mathrm{d}u}{\Psi_h(t;
{\bf \hat{a}}_r)} \, \mathrm{d}t,
\] that depend on the previous iterate $\hat{\bf p}_r$. In other words, the iteration again reduces
to an EMS algorithm; however,  it becomes \emph{adaptive} in this case because
the smoothing matrix itself depends on $\hat{\bf p}_r$.

We note that the remarks in the previous section still apply. For
example, as $h\rightarrow 0$, $\Psi_h(t; {\bf \hat{a}}_r)=\int_\Re
K(v)\exp\{\sum_{j=1}^p \hat{a}_jh^jv^j\}dv\rightarrow 1$ which means
the matrix ${\cal K}(\hat{\bf p}^{r})$ will again become the
identity matrix, and the iteration again reduces to Turnbull's EM
algorithm. Furthermore, in the local linear case, we will have
$\Psi_h(t; {\bf \hat{a}}_r) \geq 1$ for all $t$.  Therefore,
$$
\int_{J_l} \dfrac{\int_{J_j} K_h\left({u-t}\right) \,
\mathrm{d}u}{\Psi_h(t; {\bf \hat{a}}_r)} \, \mathrm{d}t \leq %
\int_{J_l} \int_{J_j} K_h\left({u-t}\right) \, \mathrm{d}u \,
\mathrm{d}t,
$$ and the previous remarks on convergence will still hold.

\section{Local Likelihood Intensity Estimation}
In this section details are similar to the density estimation
context with the exception that the local likelihood differs
slightly and the data have a more complex structure. Here events
times for subjects observe a common non-homogeneous Poisson process
and we assume interest centers on the unknown intensity $\lambda$.
This may be estimated through the use of the local likelihood
\begin{eqnarray}\label{llie}
{\cal L}(\lambda,t)=\sum_{ij}
Y_i(T_{ij})K_h(T_{ij}-t)\log\{\lambda(T_{ij})\}-
\int_0^{\infty}Y(u)K_h(u-t)\lambda(u)\, \mathrm{d}u,
\end{eqnarray}
and here Loader (1999) suggests a recipe similar to the case of
density estimation. This involves approximating the log intensity
with the polynomial $\cal P$ and, upon
substitution, maximizing the above with respect to the coefficients
$\bf a$. The intensity estimate at the point $t$ is then given by
$\hat{\lambda}(t)=\exp\{\hat{a}_0\}$.

Now since the $i$th subject is only observed at times ${\cal T}_i$, the $T_{ij}$ are interval censored and we can extend the strategy
of Braun, Duchesne and Stafford (2005). Here we replace the local
likelihood (\ref{llie}) with
\begin{equation} \label{llemie}
\begin{split}
{\cal L}({\bf a},t)&=\sum_i^n E_\lambda{\Bigg [}\sum_j^{n_i}
Y_i(T_{ij})K_h(T_{ij}-t){\cal P}(T_{ij} - t){\Bigg |}{\cal I}{\Bigg
]}\\ &\hspace{1.5in} - \int_0^{\infty}Y(u)K_h(u-t)\exp\{{\cal P}(u - t) \}\,
\mathrm{d}u
\end{split}
\end{equation}
and estimate $\lambda$ using a local-EM algorithm which cycles
through two steps at each iteration:
\begin{quote}
{\tt E-step}: compute the relevant expectations using the current
estimate of $\lambda$
\newline
{\tt M-step}: maximize ${\cal L}({\bf a}_t)$ to get updated
estimates of $\{a_0, a_1, \ldots, a_p\}$ and $\lambda$.
\end{quote}
Here $\cal I$ represents the event that for all $T \in \cup_{ij}\{T_{ij}\}$ we have $T\in I_{i'j'}$ for some $i'$ \& $j'$.


\subsection{Local-EM and the EMS algorithm}
Since the event times $T_{ij}$ arise from an nonhomogeneous Poisson
process, the expectation in the first term of ${\cal L}({\bf a},t)$
simplifies to
$$
E_\lambda{\Bigg [}\sum_j^{n_i} Y_i(T_{ij})K_h(T_{ij}-t){\cal
P}(T_{ij} - t){\Bigg |}{\cal I}{\Bigg ]}=\sum_j^{k_i} n_{ij}
E_\lambda[Y_i(T)K_h(T-t){\cal P}(T - t)|T\in I_{ij}],
$$
where $T$ denotes a random variable over the panel $I_{ij}$.
Furthermore the conditional density of $T$ is determined by the intensity
$\lambda$ and has the following form:
\begin{eqnarray} \label{conden}
{{{\lambda}(u)}\over{\int_{I_{ij}}{{\lambda}(t)} \, \mathrm{d}t}}.
%\lambda(t){\Bigg /}\int_{I_{ij}}\lambda(u)\,\mathrm{d}u.
\end{eqnarray}

In the local constant case the local-EM algorithm is given by
\begin{eqnarray}\label{intiter}
\hat{\lambda}_{r+1}(t)=\sum_{ij}n_{ij}\mbox{E}_{\hat{\lambda}_{r}}\left[
\left.
Y_i(T)K_h\left({T-t}\right)\right|I_{ij} \right]{\Bigg
/}\int_0^{\infty}Y(u)K_h(u-t)\, \mathrm{d}u,
\end{eqnarray}
where expectation is computed with respect to the conditional
density (\ref{conden}) with $\lambda$ replaced by
$\hat{\lambda}_{r}$. Adopting the strategy of  \S 3 we
consider replacing $\hat{\lambda}_r$ with the piecewise constant
function $\hat{g}_r(t)=\hat{\Lambda}_{r\ell}/||J_\ell||$ for $t\in
J_\ell$, where $\hat{\Lambda}_{r\ell}=\int_{J_\ell}
\hat{\lambda}_r(u)\, \mathrm{d}u$. As a result, the conditional
density (\ref{conden}) is estimated by
\[
\hat{g}_{r; I_{ij}}(u)={{\hat{\Lambda}_{r\ell}{\cal
I}_{ij\ell}}\over{||J_\ell||\sum_m \hat{\Lambda}_{rm} {\cal I}_{ijm}}}
\text{ for $u\in J_\ell$,}
\]
and $\mbox{E}_{\hat{\lambda}_{r}}\left[\left.
Y_i(T)K_h\left({T-t}\right)\right|I_{ij}\right]$ is replaced in
(\ref{intiter}) by
\begin{eqnarray*}
\mbox{E}_{\hat{g}_{r}}\left[\left.
Y_i(T)K_h\left({T-t}\right)\right|I_{ij}\right]&=&\int_{I_{ij}}Y_i(u)K_h\left({u-t}\right)
\hat{g}_{r; I_{ij}}(u) \, \mathrm{d}u\\
%
&=&\sum_l{{\hat{\Lambda}_{rl}{\cal I}_{ijl}\int_{J_l}Y_i(u)
K_h\left(u-t\right)\,
\mathrm{d}u}\over{||J_l||\sum_m
\hat{\Lambda}_{rm} {\cal I}_{ijm}}}\\
%
&=&\sum_l{{Y_i(\tau_l)\hat{\Lambda}_{rl} {\cal
I}_{ijl}\int_{J_l}K_h\left({u-t}\right)\,
\mathrm{d}u}\over{||J_l||\sum_m \hat{\Lambda}_{rm} {\cal I}_{ijm}}}.
\end{eqnarray*}
%
This leads to a simple iteration for $\hat{\bf \Lambda}^r$
\begin{eqnarray*}
\hat{\Lambda}_{r+1, s}&=&\int_{J_s} \hat{\lambda}_{r+1}(t)\, \mathrm{d}t\\
%
&=&\int_{J_s}{\Bigg
\{}\sum_{ijl}n_{ij}{{Y_i(\tau_l)\hat{\Lambda}_{rl}{\cal
I}_{ijl}}\over{||J_l||\sum_m \hat{\Lambda}_{rm} {\cal
I}_{ijm}}}{{\int_{J_l}K_h\left({u-t}\right)\,
\mathrm{d}u}\over{\int_0^{\infty}Y(u)K_h(u-t)\, \mathrm{d}u}}{\Bigg
\}}\, \mathrm{d}t\\
%
&=&\sum_{ijl}n_{ij}{{Y_i(\tau_l)\hat{\Lambda}_{rl}{\cal
I}_{ijl}}\over{||J_l||\sum_m \hat{\Lambda}_{rm} {\cal
I}_{ijm}}}\int_{J_s}{{\int_{J_l}K_h\left({u-t}\right)\,
\mathrm{d}u}\over{\int_0^{\infty}Y(u)K_h(u-t)\, \mathrm{d}u}}\,
\mathrm{d}t\\
%
&=&\sum_{ijl}n_{ij}{{Y_i(\tau_l)\hat{\Lambda}_{rl} {\cal
I}_{ijl}}\over{||J_l||\sum_m \hat{\Lambda}_{rm} {\cal
I}_{ijm}}}{{Y(\tau_l)}\over{Y(\tau_l)}}\int_{J_s}{{\int_{J_l}K_h\left({u-t}\right)\,
\mathrm{d}u}\over{\int_0^{\infty}Y(u)K_h(u-t)\, \mathrm{d}u}}\,
\mathrm{d}t,
\end{eqnarray*}
which may be expressed in terms of matrices as
\begin{eqnarray}\label{EMSic}
\hat{\bf \Lambda}_{r+1}={\cal M}(\hat{\bf \Lambda}_{r}) {\cal K}_h,
\end{eqnarray}
where ${\cal K}_h$ is a $k \times k$ smoothing matrix with
entries
$${\cal K}_{ls}=
{{Y(\tau_l)}\over{||J_l||}}\int_{J_s}{{\int_{J_l}K_h\left({u-t}\right)\,
\mathrm{d}u}\over{\int_0^{\infty}Y(u)K_h(u-t)\, \mathrm{d}u}}\,
\mathrm{d}t,$$ and ${\cal M}(\hat{\bf \Lambda}^{r})$ is a $k$
dimensional row vector whose $l$th entry is
$$
\sum_{ij}n_{ij}{{Y_i(\tau_l)\hat{\Lambda}_{rl} {\cal
I}_{ijl}}\over{Y(\tau_l)\sum_m \hat{\Lambda}_{rm} {\cal I}_{ijm}}}.
$$
The latter is recognized as a step in the EM algorithm of Hu,
Lagakos and Lockhart (2008) and hence the iteration (\ref{EMSic}) is
seen to involve an expectation, maximization $and$ smoothing step.
That is, our implementation of the local-EM algorithm
(\ref{intiter}) has resulted explicitly in an EMS algorithm.

%\newpage
\vspace{25pt}
\noindent {\bf General Remark:} \bigskip

\noindent As $h\downarrow 0$ the integral $
\int_{J_l}K_h\left({u-x}\right)\, \mathrm{d}u$ becomes the indicator
function $1_{J_l}(t)$ for $J_l$, $\int_0^{\tau_{k_i}} K_h(u-t)\,
\mathrm{d}u \rightarrow Y_i(t)$ and $\int_0^{\infty}Y(u)K_h(u-t)\,
\mathrm{d}u\rightarrow Y(t)$. As a result we have
$$\lim_{h\downarrow 0}{\cal K}_{ls}=\lim_{h\downarrow
0}{{Y(\tau_l)}\over{||J_l||}}\int_{J_s}{{\int_{J_l}K_h\left({u-t}\right)\,
\mathrm{d}u}\over{\int_0^{\infty}Y(u)K_h(u-t)\, \mathrm{d}u}}\,
\mathrm{d}t={{Y(\tau_l)}\over{||J_l||}}\int_{J_s}{{1_{J_l}(t)}\over{Y(t)}}\,
\mathrm{d}t=\delta_{jl}.$$ Consequently ${\cal K}_h$ converges to
the identity matrix and the iteration (\ref{EMSic}) becomes the EM
algorithm of Hu, Lagakos and Lockhart (2008).

\vspace{20pt} \noindent {\bf Remark on an Adaptive EMS Algorithm}
\bigskip

\noindent In general when the approximating polynomial is truncated
at the $(p+1)^{st}$ term, the local-EM algorithm becomes
\begin{eqnarray*}
\hat{\lambda}_{r+1}(x)=\sum_{ij}
n_{ij}\mbox{E}_{\hat{\lambda}_{r}}\left[\left.
Y_i(T)K_h\left({T-t}\right)\right|I_{ij}\right]/\Psi_h(\hat{\bf a}_r)
\end{eqnarray*}
where $\Psi_h$ is now
\begin{eqnarray*}
\Psi_h(t; \bf
a)&=&\int_0^{\infty}Y(u)K_h(u-t)\exp\left\{\sum_{j=1}^p a_j(u-t)^j
\right\}\, \mathrm{d}u,
\end{eqnarray*}
and ${\bf \hat{a}}_r \equiv \mathbf{a}(t;
\hat{\boldsymbol{\Lambda}}_r)$ solves the local likelihood equations
based on (\ref{llemie}). In this context we again compute
expectations using the conditional density $\hat{g}_{r; I_{ij}}$ and
details analogous to the case of density estimation yield the
following iteration for estimating the $\Lambda_j$'s
\begin{eqnarray*}
\hat{\Lambda}_{r+1,
s}&=&\sum_{ijl}n_{ij}{{Y_i(\tau_l)\hat{\Lambda}_l^r{\cal
I}_{ijl}}\over{||J_l||\sum_m \hat{\Lambda}_m^r {\cal
I}_{ijm}}}{{Y(\tau_l)}\over{Y(\tau_l)}}\int_{J_s}{{\int_{J_l}K_h\left({u-t}\right)
\, \mathrm{d}u}\over{\Psi_h} [\mathbf{a}(t;
\mathbf{\hat{\Lambda}}_r)] }\, \mathrm{d}t.
\end{eqnarray*}
This may again be written in terms of matrices
\begin{eqnarray}\label{EMSiho}
\hat{\bf \Lambda}_{r+1}={\cal M}(\hat{\bf \Lambda}_{r}) {\cal
K}_h(\hat{\bf \Lambda}_{r}).
\end{eqnarray}
Here $\cal M$ is defined as before however the smoothing matrix now
has entries
$${\cal K}_{ls}=
\frac{Y(\tau_l)}{|\!|J_l|\!|}\int_{J_s}{{\int_{J_l}K_h\left({u-t}\right)\,
\mathrm{d}u}\over{\Psi_h[\mathbf{a}(t;
\mathbf{\hat{\Lambda}}_r)]}}\, \mathrm{d}t$$ that depend on the
previous iterate $\hat{\bf \Lambda}_r$. In other words, the
resulting EMS algorithm is again adaptive because the smoothing
matrix itself depends on $\hat{\bf \Lambda}_r$. Furthermore, as
$h\rightarrow 0$, the fact that $\Psi_h(t; {\bf
\hat{a}}_r)\rightarrow Y(t)$ allows the matrix ${\cal K}(\hat{\bf
\Lambda}_{r})$ to again become the identity matrix, and thus the
iteration again reduces to the EM algorithm of Hu, Lagakos and
Lockhart (2008).

\vspace{20pt} \noindent {\bf Remark on Convergence} \bigskip
%[\mathbf{\hat{a}}_r(t; \mathbf{\hat{\Lambda}}_r)]

In the case of intensity estimation, the second term of (\ref{llemie}) is integrated over
the interval in which $Y(u) >0$. This has implications for convergence as the conditions required to ensure the spectral radius of ${\cal K}_h$ is less than one are more complicated. Assuming a symmetric kernel with a
compact support on $[-1,1]$ and an interior point $t$, we may show
by the Taylor expansion that
\begin{equation} \label{e: phi_ineq}
\Psi_h(t; {\bf a}) \approxeq Y(t) + Y(t) \left(a_1^2 + 2a_2
\right)\sigma(K) h^2/2 + o(h^2),
\end{equation}
where $\sigma(K) = \int z^2 K(z) \,\mathrm{d}z$. %Now if the right
%hand side of the inequality (\ref{e: phi_ineq}) is greater than or
%equal to 1 for all $t\in \mathcal{S}$,
%then the spectral radius of ${\cal K}_h$ is less than one since
The upper bound of the spectral radius convergence and its rate will
depend on $\dfrac{\Psi_h(t; {\bf a})}{d}$
\begin{align*}
\int_{J_s} \frac{\int_{J_l}K_h\left({u-t}\right)\,
\mathrm{d}u}{\Psi_h(t; {\bf \hat{a}_r})/ Y(\tau_l)}\, \mathrm{d}t %&
%\leq \sum_l |\!| J_l |\!|^{-1} \int_{J_s} \int_{J_l}
%K_h\left({u-t}\right)\, \mathrm{d}u \, \mathrm{d}t < 1.
\end{align*}
It is worthwhile to note that the inequality (\ref{e: phi_ineq})
almost certainly holds in the local constant and local linear cases
for some $h>0$ provided that $m > 1$.


\section{A Simulation Study}

In the case of density estimation simulations given in Braun,
Duchesne and Stafford (2005) demonstrate local-EM algorithms improve
the speed of convergence particularly for large values of the window
size $h$. As noted in \S 3, this improvement becomes transparent
given we have identified a relationship between local-EM and EMS.
As a result simulations presented in this paper focus on
improvements when using local-EM due to the reduction of mean integrated
squared error (MISE). Simulations were conducted for both
density and intensity estimation but we only report those for the
intensity case. Results for the density case were extremely similar.

In the case of intensity estimation, we simulate a nonhomogeneous
Poisson process with intensity $\lambda(t)$ equal to a re-scaled
gamma density function (shape = 9 and rate=3/4). A subject's event
times are simulated by thinning an unit-intensity Poisson process
where event times are either accepted or rejected with a probability
equal to $\lambda(t)$. Each subject is assumed to have a sequence of
predetermined observation times $t_1, t_2, \ldots, t_K$ where $t_i =
i$ and $K=20$. However, subjects miss a visit with increasing
probability, specifically, the probability of missing a visit equals
$(t_i/20)^{1/4} - 0.05$. Finally, a subject's panel counts are
obtained by aggregating events times among consecutive observed
visits. Note that each subject is assumed to have no event at time 0.

For each sample and for a fixed window $h$, we compute several intensity
estimates with a Gaussian kernel. The first, $\hat{\lambda}_{ke}$, assumes
no interval censoring has taken place and uses the event times
themselves, rather than the panel counts. Using the data-dependent partition, we then compute the
local-EM estimator in both constant and linear cases and several other
estimators that are all based on smoothing the self-consistent
estimator, $\hat{\Lambda}_{HLL}$, in Hu, Lagakos and Lockhart (2008)
after their algorithm has converged. The distinction between these
latter estimators depends on whether weights associated with
$\hat{\Lambda}_{HLL}$, are placed at the left-end, the middle, or
the right-end point of each $J_j$. For each window size $h$, K
samples were generated, and these were used to approximate each estimator's MISE. We use the average ISE
defined as $K^{-1} \sum_k \int (\hat{\lambda}_k(u) - \lambda(u))^2 \, \mathrm{d}u$ as an estimate of MISE.
This was performed for 40 different values of $h$ between $0.05$ and $3.95$ with $K=300$. The resulting MISE's for each estimator
are plotted in Figure 1.

The results favour the local-EM estimator considerably. Not only
does it track the ideal estimator, $\hat{\lambda}_{ke}$ quite
closely, but it also attains the smallest MISE for all estimators
based on the censored data ($\hat{\lambda}_{ke}$ achieves a smaller
MISE but that's because it's based on the non-interval-censored
data). This is perhaps not all that surprising given $\lambda(t)$ is
quite non-linear. In cases where $\lambda(t)$ is linear the
improvements in MISE for the local-EM estimator are not as dramatic
and care is required otherwise the local-EM estimator can perform poorly.
\bigskip
\begin{center}
{\it Figure 1 here}
\end{center}
\bigskip

\section{The Role of local-EM}
Thus far we have exposed an interesting relationship between classes of
algorithms that demonstrates the EMS algorithm arises naturally from
local likelihood considerations. This occurs because we have chosen
to discretize the infinite dimensional parameter $f$ to enable
implementation of the local-EM algorithm. However, we could have
instead chosen to implement this algorithm through multiple
imputation, or by using MCEM, or through some other favourite
technique. So why EMS?

In this section we present results meant to strengthen the
suggestion that, for the context considered in this paper, local-EM
and penalized likelihood may be paired in a manner analogous to the
pairing of EM and likelihood. We focus primarily on the local
constant density estimation as the details here are easier to
follow. However, the details required for other cases, particularly
intensity estimation, are sketched in the remarks given.

In \S 6.1 we demonstrate that use of an equivalent kernel in a
local-EM algorithm leads to the modification necessary to maximize
a penalized likelihood (Nychka 1990). In \S 6.2 we present
uniform convergence results that suggest local-EM and EMS techniques
may be thought of synonymously. In \S 6.3 consider the constraint defined by the penalty of \S
6.1 under conditions given in \S 6.2. In particular, we speculate that
local-EM penalizes the usual nonparametric likelihood for departures
of the target function from a class $\cal C$ identified by the
eigenfunction of the kernel.

\subsection{Local-EM and the Modified EMS algorithm}

Nychka (1990) identified a relationship between EMS and penalized
likelihood by demonstrating that a modified EMS algorithm was
equivalent to the one step late algorithm of Green (1990). In this
section, we demonstrate that with the appropriate choice of kernel,
an equivalent kernel, the local-EM algorithm may be used to
maximize a penalized likelihood. This occurs because the equivalent
kernel leads to Nychka's modification of the EMS algorithm.

We begin with the case of density estimation. Assume an equally
spaced partition ${\cal J}$, and consider a reparameterization of
Turnbull's likelihood as
\begin{equation*}
\begin{split}
\mathcal{L}(\boldsymbol{\theta}) &= \sum_i \log\left(\sum_k
\mathcal{I}_{i, k} \theta_k^2 \right) - n\sum_k \theta_k^2 - n,
\end{split}
\end{equation*}
where $\theta_j^2 =p_j$. The inclusion of a roughness penalty yields
\[
\mathcal{L}_{p}(\boldsymbol{\theta}) =
\mathcal{L}(\boldsymbol{\theta}) - \boldsymbol{\theta}^{T}
\mathbf{R} \boldsymbol{\theta}
\]
where $\mathbf{R} = n (\mathbf{S}^{-1} - \mathbf{I})$, and
$\mathbf{S}$ is a symmetrical smoothing matrix. We explore the
relationship between this penalized likelihood and the local-EM
algorithm.

Consider the function
\begin{equation} \label{e:equivalent_kernel}
(1/f(u))^{1/2} K_h(u-t), \end{equation} where $f$ is the true, but
unknown, density, and $K_h(u-t)$ is any symmetric positive kernel.
Renormalization of (\ref{e:equivalent_kernel}) gives
\[
K_{h}^{\ast}(u - t) = (f(t)/f(u))^{1/2} K_h(u-t),
\]
where $\int_{\Re} K_{h}^{\ast}(u - t) \, \mathrm{d}u = 1+o(h)$. We
consider use of the kernel $K_h^{\ast}$ in our local-EM algorithm
where global use of the $\cal J$-approximant for $\hat{f}_r$ results
in $\mbox{E}_{\hat{f}_{r}}\left[\left.
K_h^{\ast}\left({T_i-t}\right)\right|I_i\right]$
being replaced in (\ref{iter}) with
\begin{eqnarray*}
\mbox{E}_{\hat{g}_{r}}\left[\left.
K_h^{\ast}\left({T_i-t}\right)\right|I_i\right]&=&
\int_{I_i} (\hat{p}_{r\ell}/\hat{p}_{rj})^{1/2} \,
K_h\left({u-t}\right)\hat{g}_r(u; I_i)\, \mathrm{d}u \quad \text{for
$t \in J_\ell$.}
\end{eqnarray*}
This in turn gives the following iteration for the
unknown probabilities $p_j$:
\begin{equation}\label{e:modified_EMS}
\hat{p}_{r+1, \ell}= n^{-1} \sum_k (\hat{p}_{r\ell}/\hat{p}_{rj})^{1/2} |\!|
J |\!|^{-1} \int_{J_\ell} \int_{J_k}K_h(u-t) \, \mathrm{d}u \,
\mathrm{d}t \sum_i \dfrac{ \mathcal{I}_{ik} \hat{p}_{rk}}{\sum_m
\mathcal{I}_{im} \hat{p}_{rm}},
\end{equation}
which can be expressed in the matrix form as
\begin{equation} \label{modified_EMSpl}
\hat{\bf p}_{r+1}={\cal M}(\hat{\bf p}_{r}) {\cal
K}_h^{\ast}(\hat{\bf p}_{r}), \quad
%
\mbox{where ${\cal K}_h^{\ast}(\hat{\bf p}_{r}) =
\widehat{\boldsymbol{\Theta}}_r \, \mathcal{K}_h \,
\widehat{\boldsymbol{\Theta}}_r^{-1}$.}
\end{equation}
Here $\boldsymbol{\Theta}_r = \text{diag}(p_{rj}^{1/2})$ and the
iteration is recognized as Nychka's (1990) modified EMS algorithm
with ${\bf S}={\cal K}_h$.

\vspace{20pt} \noindent {\bf Remark on Intensity Estimation}
\bigskip

\noindent In the case of intensity estimation renormalization of the function $ (1/\lambda(u))^{1/2} K_h(u-t)$
gives
\[
K_{h}^{\ast}(u - t)=(\lambda(t)/\lambda(u))^{1/2} K_h(u-t).
\]
After some detail similar to the above, we can show that use of the
kernel $K_{h}^{\ast}(u - x)$ in the local-EM algorithm
(\ref{intiter}) leads to the following iteration:
\begin{equation*}
\hat{\boldsymbol{\Lambda}}_{r+1}={\cal
M}(\hat{\boldsymbol{\Lambda}}_{r}) {\cal
K}_h^{\ast}(\hat{\boldsymbol{\Lambda}}_{r}), \quad \mbox{where}
~{\cal K}_h^{\ast}(\hat{\bf \Lambda}_{r})
=\widehat{\boldsymbol{\Lambda}}_{r}^{1/2} \, \mathcal{K}_h
\,\widehat{\mathbf{\Lambda}}_{r}^{1/2}.
\end{equation*}
Here the entry ${\cal K}_{sl}$ depends on $\widehat{\bf \Lambda}_r$
due to the denominator $\int_0^{\infty} Y(u) K_h^{\ast}(u-t) \,
\mathrm{d}u$. However, when the kernel has a compact
support, and $t$ is an interior point, we will
have
\[\int_0^{\infty} Y(u) K_h(u-t) \, \mathrm{d}u =
Y(t) \mbox{ for some small $h > 0$} \]
%
and Nychka (1990) showed the iteration will then maximizes the above penalized
likelihood where ${\cal L}(\boldsymbol{\theta})$ is now
$$\mathcal{L}(\boldsymbol{\theta}) = \sum_{i, j} n_{i, j}
\log\left( \sum_k \mathcal{I}_{i, j} |\!| J |\!| \theta_k^2 \right)
- \sum_k Y_{\cdot}(\tau_k) |\!| J |\!| \theta_k^2$$
and $\theta_j^2=\Lambda_j$. Otherwise
${\cal K}_h$ will depend on $\hat{\bf \Lambda}^r$ and the next
remark applies.

\vspace{20pt} \noindent {\bf Remark on an Adaptive EMS Algorithm}
\bigskip

\noindent In cases where local-EM results in an adaptive EMS
algorithm the equivalent kernel remains the same. However, some
care is required and we only sketch details relying to some extent
on those given in Green (1990). In the adaptive case, the smoothing
matrix $S$ is now replaced by ${\cal K}_h(\boldsymbol{\theta})$
which depends on $\boldsymbol{\theta}$. This occurs because ${\cal
K}_h(\boldsymbol{\theta})$ involves the solution, $\hat{\bf
a}_{\boldsymbol{\theta}}$, to a set of local likelihood equations
that in turn involve $E_\theta$, or rather an E-step. In reference to Green
(1990) this implies that $J(\boldsymbol{\theta}')$ of expression (2) is then replaced by
$J(\boldsymbol{\theta}'|\boldsymbol{\theta})$, indicating that both
terms of (2) now involve an E-step. What we claim is that use of
the proposed equivalent kernel yields the one step late algorithm
for expression (2) of Green with $J(\boldsymbol{\theta}')$ replaced
with $J(\boldsymbol{\theta}'|\boldsymbol{\theta})$. What we do not
show is that this algorithm maximizes expression (1) of Green
(1990), but we suspect this is true.

\subsection{Uniform Convergence of EMS to local-EM}
In this section we consider the partition ${\cal J}_k$ introduced at
the end of \S 3.1 and demonstrate that, as $k\rightarrow \infty$, an
EMS iterate will converge uniformly to its local-EM counterpart. We
initially restrict attention to the case of density estimation where
we denote the EMS and local-EM iterates as $\hat{f}_{k\,r}$ and
$\hat{f}_{\infty \, r}$, respectively. In addition, we assume
$R_i<\infty$ for all $i$, or rather that $\cal S$ is a finite
interval and we assume $K(z)$ is a symmetric positive kernel with $\int K(z)
\, \mathrm{d}z =1$. Finally, we define a norm on $\mathcal{S}$ to be
$|\!|f|\!|_1 = \int_{\cal S} | f(u) |\, \mathrm{d}u$ and interpret
the uniform convergence of the function $f$ to the function $g$ to
mean that $|\!|f-g|\!|_1 \rightarrow 0$ as $k \to \infty$. This we
denote as $f \xrightarrow{{\cal L}^1} g$. These details permit the
statement of the following theorem:
%
\begin{theorem} \label{t:uniform_convergence} \end{theorem}
%\vspace{-5pt}
\begin{description}
\item{\bf I.} Define $\mathcal{F}_1 = \left\{ f \in \mathcal{L}^1 \mid
\text{$f$ is
nonnegative with $f(t) > 0$ for all $t \in \mathcal{S}$}\right\}$.
For a common initial value $\hat{f}_0 \in \mathcal{F}_1$, we have
for all $r = 1, 2, \ldots$:
\begin{description}
\item{A.} $\hat{f}_{k,r} \stackrel{\mathcal{L}^1}{\longrightarrow}
\hat{f}_{\infty,r}$;
\item{B.} $\hat{f}_{k,r}$, $\hat{f}_{\infty,r} \in \mathcal{F}_1$.
\end{description}
\label{e: ems_iterate_v1} %%
\item{\bf II.} When the equivalence kernel of \S 6.1 is used, we may
instead define
\[ \mathcal{F}_2 = \left\{ f \in \mathcal{L}^1 {\Big |} \text{$f$ is
nonnegative with $f(t) > 0$ for all $t \in \mathcal{S}$ and $\int
f^{1/2} < \infty$}\right\}.\] Now for a common initial value
$\hat{f}_0 \in \mathcal{F}_2$ with $|\!|\tilde{f}_0|\!|=1$, the
above results, A and B, still hold for all $r$ where, of course,
${\cal F}_1$ is replaced with ${\cal F}_2$ in B. \label{e:
ems_iterate_v2}
\end{description}

\noindent {\bf Proof:} The proof is given in Appendix
\ref{app_uniform}. It relies on $\mathcal{H}_t(f) = \int_\Re
K_h(u-t) f(u)\, \mathrm{d}u$ being a bounded linear functional as
well as some other basic results in operator theory that may be
found in Royden (1988).

\vspace{20pt} \noindent {\bf Remark on Intensity Estimation}
\bigskip

\noindent %In the case of intensity estimation,
%we assume $\lambda$ to be in an appropriate class of functions so that, for example,
%$\lambda \in \mathcal{F}_1$.
Let ${\cal Y}_h(t) = \int_0^{\infty} Y(u) K_h(u-t) \,\mathrm{d}u$ and
define ${\cal H}_t(\lambda)$ to be
\[ \int \frac{K_h(u-t)}{{\cal Y}_h(t)} \lambda(u) \, \mathrm{d}u. \]
If $K_h$ is a symmetric positive kernel and ${\cal Y}_h(t) \geq 1$, then ${\cal H}_t$ is still a linear
bounded functional. As a result, demonstrating the uniform convergence
of an EMS iterate $\hat{\lambda}_{k\,r}$ to its
local-EM counterpart, $\hat{\lambda}_{\infty\,r}$, is virtually
identical to the proof of the first part of Theorem
\ref{t:uniform_convergence} provided that the initial value
$\hat{\lambda}_0 \in \mathcal{F}_1$.

Should $K_h$ be an equivalent kernel as defined in
\S 6.1, we are still able to show uniform
convergence by exploiting the properties of another functional defined as follow:
$$ \int \lambda^{1/2}(t) \frac{K_h(u-t)}{{\cal Y}_h(t)}
\lambda^{1/2}(u)\, \mathrm{d}u \, \mathrm{d}t.$$
This functional is again bounded as long as
${\cal Y}_h(t) \geq 1$ for some $h>0$. Therefore, for an initial
value $\hat{\lambda}_0 \in \mathcal{F}_2$, the proof is still very
much similar to the second part of Theorem
\ref{t:uniform_convergence}.


\subsection{Local-EM and Penalized Likelihood}

In this section we study the penalized likelihood of \S 6.1 under
the conditions of \S 6.2 where $k\rightarrow \infty$ or,
equivalently, $\Delta \downarrow 0$. We begin by considering those
values of $\boldsymbol{\theta}$ for which the penalty $\
\boldsymbol{\theta}^{T} \mathbf{R} \boldsymbol{\theta}$ is
minimized. For such $\boldsymbol{\theta}$ we have
\begin{eqnarray*}
\mathbf{R} \boldsymbol{\theta}&=&(\mathbf{\cal K}_h^{-1}
-\mathbf{I})\boldsymbol{\theta}=0
\end{eqnarray*}
or rather
\begin{eqnarray}\label{eigen}
\boldsymbol{\theta}&=&\mathbf{\cal K}_h\ \boldsymbol{\theta}.
\end{eqnarray}
This permits an interpretation of ${\cal L}_p(\boldsymbol{\theta})$
as penalizing the likelihood of Turnbull (1976) on the basis of the
proximity of $\boldsymbol{\theta}$ to the maximal eigenvectors of
the smoothing matrix $\mathbf{\cal K}_h$. To see this, let
$\rho_{(\ell)}$ denote the $\ell$th largest
eigenvalue of $\mathcal{K}_h$ with its corresponding eigenvector
$\boldsymbol{\gamma}_{(\ell)}$. Let
$\boldsymbol{\Gamma}=\begin{bmatrix} \boldsymbol{\gamma}_{(k)} &
\boldsymbol{\gamma}_{(k-1)}& \cdots & \boldsymbol{\gamma}_{(1)}
\end{bmatrix}$. Then the spectral decomposition of $\mathbf{R}$ is
$ \boldsymbol{\Gamma} \mathbf{D} \boldsymbol{\Gamma}^{T},$ where
$\mathbf{D}=\text{diag}\left( \rho_{(k)}^{-1}, \ \rho_{(k-1)}^{-1}, \
\ldots, \ \rho_{(1)}^{-1}\right) - \mathbf{I}$.
%
Since $\rho_{(k)}^{-1} \leq \ldots \leq \rho_{(1)}^{-1} \leq 1$,
$\mathbf{R}$ penalizes eigenvectors with small eigenvalues more than
that with large ones.

We note that for $\boldsymbol{\theta}$ satisfying the condition
(\ref{eigen}) we also have
\begin{equation}
\boldsymbol{\theta}^T\boldsymbol{\theta}-\boldsymbol{\theta}^T{\cal K}_h\
\boldsymbol{\theta}=0 \label{eigen_2}
\end{equation}
and we may consider the limit of the left-hand side as $k
\rightarrow \infty$. Note ${\cal K}_{jk}$ and $\theta_j$ may
be approximated as
\begin{eqnarray*}
{\cal K}_{kj}&=&\Delta^{-1} \int_{J_j}\int_{J_k}K_h(u-t)\, \mathrm{d}u dx
\approx K_h(x_j-x_k) \, \Delta \\
\theta_j&=&\left\{\int_{J_j}f(u)\,
\mathrm{d}u\right\}^{1/2}= f^{1/2}(t_j) \, \Delta
\end{eqnarray*}
and so the left-hand side of (\ref{eigen_2}) becomes
$$
\sum_j f(t_j)\, \Delta-\sum_{jk}f^{1/2}(t_j)K_h(x_j-x_k)f^{1/2}(t_k)\,
\Delta^2.
$$
If we let $\Delta \downarrow 0$ the above expression becomes
$$
\int_{\cal T}f(u)\, \mathrm{d}u-\int_{\cal T}\int_{\cal
T}f^{1/2}(t)K_h(u-t)f^{1/2}(u)\, \mathrm{d}u\, \mathrm{d}t,
$$
which we note will equal 0 for any function $f$ belonging to the class
$$
{\cal C}=\left\{f \ \Bigg | \, f^{1/2}(t)=\int_{\cal
T}K_h(u-t)f^{1/2}(u)\, \mathrm{d}u \mbox{ for all $t \in {\cal T}$} \right\}.
$$
Given this and the results of \S 6.2, we speculate that the
local-EM algorithm maximizes the likelihood
\begin{eqnarray*}
\mathcal{L}(f)=\sum_i \log\left(\int_{I_i}f(u)\,
\mathrm{d}u\right) - n\left( \int_{\Re} f(u)\, \mathrm{d}u - 1 \right)
\end{eqnarray*}
on the basis of the proximity of the function $f$ to the class of
maximal eigenfunctions $\cal C$. A similar interpretation holds in
the case of intensity estimation where $f$ is replaced with $\lambda$ in the class $\cal C$.

\section{Concluding Remarks}
In this paper we have exposed an important relationship between
local likelihood and the EMS algorithm. This has had the advantage
of placing EMS in a context where it arises naturally as the
implementation of a local-EM algorithm; EMS should no longer be
viewed as {\it ad-hoc}. It has also allowed a pairing of local-EM
and penalized likelihood in a manner that is analogous with the more
familiar EM/likelihood pairing. While the focus of this paper has
concerned these relationships we conclude with two remarks
concerning intensity estimation that demonstrate the developments
here are fairly general.

\vspace{20pt} \noindent {\bf The Inclusion of Covariates}

\noindent If subjects in a study differ in important ways and this
is captured by a set of covariates ${\bf x}_i$, then we may model
the individual intensities as
$$
\lambda_i(t)=\lambda(t)\exp\{ \boldsymbol{\beta}'{\bf x}_i\}
$$
where $\boldsymbol{\beta}$ is a vector of unknown regression
coefficients. Here we may estimate $\lambda$ and $\beta$ iteratively
where $\bf \beta$ would be estimated from the appropriate likelihood
with $\lambda$ held fixed at its current estimate $\hat{\lambda}$.
Computing $\hat{\lambda}$ at each iteration would itself involve a full
iteration (until convergence) of the algorithm (\ref{EMSic}) or
(\ref{EMSiho}) where we would now have
\begin{eqnarray*}
{\cal K}_{sl}&=&
{{Y(\tau_l)}\over{||J_l||}}\int_{J_s}{{\int_{J_l}K_h\left({u-t}\right)du}\over{\sum_{i=1}^n
\exp\{{\bf
\boldsymbol{\hat{\beta}}'x}_i\}\int_0^{\infty}Y_i(u)K_h(u-t)du}}dt\\
\Psi_h(\bf
a)&=&\sum_{i=1}^n \exp\{{\bf
\boldsymbol{\hat{\beta}}'x}_i\}\int_0^{\infty}Y_i(u)K_h(u-t)\exp\left\{\sum_{j=1}^p
a_j(u-t)^j\right\}du
\end{eqnarray*}
respectively with $\boldsymbol{\hat{\beta}}$ set as the current
estimate of $\boldsymbol{\beta}$. The outer iteration between
$\boldsymbol{\beta}$ and $\lambda$ is similar to a strategy Betensky
\emph{et al.}\ (2002) use in the case of a failure time process.
Here local likelihoods are applied but there is no discussion of any
relationship to the EMS algorithm or penalized likelihood.

\noindent {\bf The spatial context}

\noindent In the spatial context the local likelihood has a form
similar to (\ref{llie}) with the exception that various components
have different meanings. For example, panels for individual subjects
could now be regions in the map of a geographic area and the
subjects themselves could be different disease maps. It is the
spatial context that was originally treated by Silverman \emph{et
al.}\ (1990) as, for example, they considered use of the EM
algorithm for the reconstruction of a PET image. A full treatment of
the spatial context in an epidemiological setting may be found in
Brown, Fan and Stafford (2008). Methods there are contrasted with
Silverman \emph{et al.}\ (1990) to account for indirectly observed
observations and the need to construct an intensity surface with
multiple maps rather than a single PET image. When there is only one
map the spatial equivalent of (\ref{EMSic}) iterates only once and
the methods reduce to those of Brillinger (1990, 1994).
\section*{Acknowledgements}

\begin{thebibliography}{99}
\bibitem{} Anderson, Per Kragh, Borgan, {\O}rnulf, Gill, Richard D.\,and
Keiding Niels (1993). \emph{Statistical Methods Based on Counting
Processes}. Springer-Verlag: New Yor.k
%
\bibitem{} Bellhouse, D. R. and Stafford, J. E. (1999). Density
estimation from Complex Survey. \emph{Statistica Sinica} {\bf 9},
407--424.
%
\bibitem{} Betensky, Rebecca A.\, Lindsey, Jane C. and Ryan, Louise M.
and Wand, M.
P. (1999) Local EM Estimation of the Hazard Function for
Interval-Censored Data. \emph{Biometrics} {\bf 1} 228--245.
%
\bibitem{} Braun, John and Duchesne, Thierry and Stafford, James E.
(2005). Local likelihood density estimation for interval censored
data. \emph{The Canadian Journal of Statistics/La revue canadienne
de statistique} {\bf 33}, 39--60.
%
\bibitem{} Gasser, Th. and M\"{u}ller, H. G. (1979). Kernel
estimation of regression functions. \emph{Lecture notes in
mathematics: smoothing techniques for curve estimation}, 23--68.
%
\bibitem{} Green, Peter. (1990). On Use of the EM Algorithm for Penalized Likelihood Estimation. \emph{Journal of the Royal Statistical Society. Series
B (Methodological)}, 443--452.
%
\bibitem{} Hu, X., Lagakos, Stephen W., Lockhart, Richard A. (2008).
Generalized least squares estimation of the mean function of a
counting process based on panel counts. \emph{Statistica Sinica},
Preprint.
%
\bibitem{} Hjort, N. L. and Jones, M. C. (1996). Locally Parametric
Nonparametric Density Estimation.
\emph{The Annals of Statistics} {\bf 4}, 1619--1647.
%
\bibitem{} Latham, Geoff A. and Anderssen, Robert S. (1994). Assessing
Quantification for the EMS algorithm.
\emph{Linear Algebra and its Applications}, {\bf 210}, 89--122.
%
\bibitem{} Latham, Geoff A. (1995). Existence of EMS solutions and a priori
estimates. \emph{SIAM Journal on Matrix Analysis and Applications},
{\bf 16}, 943--953.
%
\bibitem{} Latham, Geoff A. (1996). Accerlating the EM algorithm by
smoothing: a special case.
\emph{Applied Mathematical Letters}, {\bf 4}, 47--53.
%
bibitem{} Li, Linxiong and Watkins, Terry and Yu, Qiqing (1997).
An EM Algorithm for Smoothing the Self-consistent Estimator of
Survival Functions with Interval-censored Data. \emph{Scandinavian
Journal of Statistics} {\bf 24}, 531--542.
%
\bibitem{} Loader, Clive R. (1996). Local Likelihood Density
Estimation. \emph{The Annals of Statistics} {\bf 4}, 1602--1618.
%
\bibitem{} Nychka, Douglas (1990). Some properties of adding a smoothing
step to the EM algorithm. \emph{Statistics and Probability Letters}
{\bf 9}, 511--567.
%
\bibitem{} Pan, W. and Chappel, R. (1998). Estimating survival curves
with left truncated and interval censored
data via the EMS algorithm. {\em Communications in Statistics -
Theory and Methods} {\bf 4}, 777--793.
%
\bibitem{} Royden, H.L. (1988) \emph{Real Analysis}. Prentice-Hall.
%
\bibitem{} Silverman, B. W. and Jones, M. C. and Wilson, J. D. and
Nychka, D. W. (1990). A Smoothed $\mathrm{EM}$ Approach to Indirect Estimation
Problems, with Particular, Reference to Stereology and Emission
Tomography (with discussions). \emph{Journal of the Royal Statistical Society. Series
B (Methodological)} {\bf 2}, 271--324.
%
%\bibitem{} Tsai, Wei-Yann and Crowley, John (1985). A Large Sample
% Study of Generalized Maximum Likelihood Estimators from Incomplete Data Via
%Self-Consistency. \emph{The Annals of Statistics} {\bf 4},
%1317--1334.
%
%
\bibitem{} Turnbull, Bruce W. (1976). The Empirical Distribution
Function with Arbitrarily Grouped, Censored and Truncated
Data. \emph{Journal of the Royal Statistical Society. Series B
(Methodological)} {\bf 3}, 290--295.
\end{thebibliography}

\appendix
\section{Theorem \ref{t:uniform_convergence}} \label{app_uniform}
Before proceeding to the proof of uniform convergence, we need some
basic results in operator theory (see Royden, 1988), stated as four
lemmas as follow:
\begin{lemma}\label{lemma:piece_approx}
Let $f \in \mathcal{L}^1$. Then the $\mathcal{J}_k$-approximant $g$
of $f$ converges uniformly to $f$ on $\mathcal{S}$ as $k \to
\infty$; that is, $g \stackrel{\mathcal{L}^1}{\longrightarrow} f$.
\end{lemma}

\noindent Let the square root of a function $f$ denoted by
$f^{1/2}$ and $K(\cdot)$ be a symmetric positive kernel with
$K_h(\cdot)=K(\cdot/h)/h$ for some $h>0$. Define two functionals
$\mathcal{H}_x$ and
$\mathcal{G}_x$ on $\Re$ to be:
\begin{enumerate}
\item $\mathcal{H}_t: \mathcal{L}^1 \mapsto \mathcal{L}^1$ such
that $\mathcal{H}_t(f) = \int K_h(u-t) f(u)\, \mathrm{d}u$, and\\
\item $\mathcal{G}_t: \mathcal{L}^1 \times \mathcal{L}^1 \mapsto
\mathcal{L}^1$ such
that $ \mathcal{G}_t(f, g) = \int g^{1/2}(t) K_h (u-t) f^{1/2}(u) \,
\mathrm{d}u$.
\end{enumerate}
Note that if an equivalent kernel is used, then
\[
\int K_h^{\ast}(u-t) f(u) \, \mathrm{d}u = \int f^{1/2}(t) K_h(u-t)
f^{1/2}(u) \, \mathrm{d}u = \mathcal{G}_x(f, f)
\]
%
\begin{lemma}\label{lemma:bl_functional}
$\mathcal{H}_t$ is a linear bounded functional for all $f \in
\mathcal{L}^1$.
That is, for all $t, a, b \in \Re$, $\mathcal{H}_t(af+b) =
a\mathcal{H}_t(f)+b$, and there exists a real number $M_h$ such that
$\mathcal{H}_t (f) \leq M_h |\!| f |\!|$.
\end{lemma}

\begin{lemma} \label{lemma:integral_op1}
Let $\gamma_h(u, t) = g(t) K_h(u-t) f(u)$. Then $\gamma(u, t)$ is an
$\mathcal{L}^1$ function on $\Re^2$ with
\[
\iint \left| \gamma_h(u, t) \right| \, \mathrm{d}u \, \mathrm{d}t \leq
M_h \cdot \left(\int |g| \right) \cdot \left(\int |f| \right) .
\]
\end{lemma}
%
\begin{lemma} \label{lemma:integral_op2}
Let $g \in \mathcal{L}^{1/2}$ and $f \in \mathcal{L}^{1}$. Then
$h(x) = \int f(y-x) g(y) \, \mathrm{d}y$ is an $\mathcal{L}^{1/2}$
function that is defined almost everywhere with
\[
\left( \int |h|^{1/2} \right)^2 \leq \left(\int |f| \right) \cdot
\left( \int |g|^{1/2} \right)^2
\]
\end{lemma}

\vspace{20pt}
\noindent {\bf The Proof of Theorem \ref{t:uniform_convergence}}
\bigskip

%\begin{proof}
\noindent Here $h$ is considered to be fixed throughout the whole proof.
\begin{enumerate}
\item[I.]
\begin{enumerate}
\item[A.] Let $r=1$. Note that
$\int_{I_i} \hat{g}_{0}(u) \,\mathrm{d}u = \int_{I_i} \hat{f}_0(u)
\,\mathrm{d}u$ by the definition of $\hat{g}_{0}$. Repeated use of
the triangle inequality gives
\begin{equation*}
\begin{split}
& \left|\!\left| \hat{f}_{k,1} - \hat{f}_{\infty, 1}
\right|\!\right|_1 = n^{-1} \int_{\mathcal{S}}\left | \sum_i
\dfrac{\int_{I_i} K_h(u-t) (\hat{g}_{0}(u) - \hat{f}_0(u)) \,
\mathrm{d}u}{\int_{I_i}
\hat{f}_0(u) \,\mathrm{d}u}\right | \,\mathrm{d}t\\
%
& \hspace{.25in} \leq n^{-1} \sum_i \left( \int_{I_i}
\hat{f}_0(u)\,\mathrm{d}u \right)^{-1} \int_{\mathcal{S}} \int_{I_i}
K_h(u-t)\left| \hat{g}_{0}(u) - \hat{f}_0(u)\right|\,
\mathrm{d}u \, \mathrm{d}t \\
%line 2
&\hspace{.25in} \leq n^{-1} \sum_i \left( \int_{I_i}
\hat{f}_0(u)\,\mathrm{d}u \right)^{-1} \int_{\mathcal{S}}
\int_{\mathcal{S}} K_h(u-t) \left| \hat{g}_{0}(u) -
\hat{f}_0(u)\right| \,\mathrm{d}u \,\mathrm{d}t, \\
%line 3
& \hspace{.25in} \leq n^{-1} \sum_i \left( \int_{I_i}
\hat{f}_0(u)\,\mathrm{d}u \right)^{-1} M_h |\!|
\hat{g}_{0} - \hat{f}_0 |\!|_1 \int_{\mathcal{S}}\,\mathrm{d}t, \\
%line 4
& \hspace{.25in} = n^{-1} M_h |\!| \mathcal{S}|\!| \sum_i \left(
\int_{I_i} \hat{f}_0(u)\,\mathrm{d}u \right)^{-1} |\!| \hat{g}_{0} -
\hat{f}_0 |\!|_1.
\end{split}
\end{equation*}
Here the last inequality is due to lemma (\ref{lemma:bl_functional}).
Because $\hat{g}_{0}\xrightarrow{{\cal
L}^1} \hat{f}_0$ by Lemma (\ref{lemma:piece_approx}), we have
$\hat{f}_{k, 1} \stackrel{\mathcal{L}^1}{\longrightarrow}
\hat{f}_{\infty, 1}$.

\vspace{20pt} \noindent {\it Induction Step:}
\newline
Let $b_{ri} = \int_{I_i} \hat{f}_{\infty,
r}(v)\,\mathrm{d}v/ \int_{I_i} \hat{g}_{r}(v)\,\mathrm{d}v$. Now
assuming that $\hat{f}_{k, r} \stackrel{\mathcal{L}^1}{\longrightarrow}
\hat{f}_{\infty\, r}$, we have with the repeated use of the triangle
inequality,
\begin{align*}
%line 1
& \left|\!\left| \hat{f}_{k, r+1} - \hat{f}_{\infty, r+1}
\right|\!\right|_1 = n^{-1} \int_{\mathcal{S}}\left | \sum_i
\int_{I_i} K_h(u-t) \left(\dfrac{\hat{g}_{r}(u)}{\int_{I_i}
\hat{g}_{r}(v)\,\mathrm{d}v}-\dfrac{\hat{f}_{\infty,
r}(u)}{\int_{I_i}
\hat{f}_{\infty, r}(v) \, \mathrm{d}v}\,\mathrm{d}u\right)\right |
\,\mathrm{d}t\\
%line 2
&\hspace{.25in} \leq n^{-1} \sum_i \int_{\mathcal{S}}\int_{I_i}
K_h(u-t)\left | \left(\dfrac{\hat{g}_{r}(u)}{\int_{I_i}
\hat{g}_{r}(v)\,\mathrm{d}v}-\dfrac{\hat{f}_{\infty, r}(u)}{\int_{I_i}
\hat{f}_{\infty, r}(v)\,\mathrm{d}v}\right)\right |
\,\mathrm{d}u\,\mathrm{d}t\\
%line 3
&\hspace{.25in} \leq n^{-1} \sum_i \left( \int_{I_i}
\hat{f}_{\infty, r}(v)\,\mathrm{d}v\right)^{-1}
\int_{\mathcal{S}}\int_{\mathcal{S}} K_h(u-t) \left | b_{ri} \,
\hat{g}_{r}(u)-\hat{f}_{\infty, r}(u) \right
|\,\mathrm{d}u \,\mathrm{d}t \\
%line 4
&\hspace{.25in} \leq n^{-1} \sum_i \left( \int_{I_i}
\hat{f}_{\infty, r}(v)\,\mathrm{d}v\right)^{-1} M_h \left|\!\left|
b_{ri} \, \hat{g}_{r} -\hat{f}_{\infty, r}
\right|\!\right|_{1} \int_{\mathcal{S}} \,\mathrm{d}t \\
%line 5
&\hspace{.25in} = n^{-1} M_h |\!| \mathcal{S}|\!| \sum_i \left(
\int_{I_i} \hat{f}_{\infty, r}(v)\,\mathrm{d}v \right)^{-1}
\left|\!\left| b_{ri} \, \hat{g}_{r} -\hat{f}_{\infty, r}
\right|\!\right|_{1},
\end{align*}
where the last inequality is again due to lemma
(\ref{lemma:bl_functional}). Now since $\hat{g}_{r}\xrightarrow{{\cal
L}^1} \hat{f}_{\infty, r}$ and $b_{ri} \to 1$, which are the immediate
consequences of the induction assumption, we have, for all $i$,
\begin{equation*}
\Big|\!\Big|b_{ri} \, \hat{g}_{r} - \hat{f}_{\infty, r}
\Big|\!\Big|_1 \leq \Big|\!\Big| b_{ri} \, \hat{g}_{r} - \hat{g}_{r}
\Big|\!\Big|_1 + \Big|\!\Big| \hat{g}_{r} - \hat{f}_{\infty, r}
\Big|\!\Big|_1 \to 0.
\end{equation*}

Hence, by the induction principle, $\hat{f}_{k, r+1}
\stackrel{\mathcal{L}^1}{\longrightarrow} \hat{f}_{\infty, r+1}$ on
$\mathcal{S}$.
%
\item[B.] The induction argument can also be applied to show that
$\hat{f}_{k,r}$, $\hat{f}_{\infty,r} \in \mathcal{F}_1,~\forall r$.

Provided a symmetric positive kernel $K_h$, the initial estimate
$\hat{f}_0 \in \mathcal{F}_1$ ensures that $\hat{f}_{k, 1}$ and $\hat{f}_{\infty, 1}$ also belong to
the class
$\mathcal{F}_1$. For example, for all $t \in \Re$,
\begin{align*}
\hat{f}_{k, 1}(t) &= n^{-1} \sum_i \dfrac{\sum_j \mathcal{I}_{i, j}
\hat{p}_{0j} |\!|J_j|\!|^{-1} \int_{J_j} K_h(u-t)
\, \mathrm{d}u }{\sum_\ell \mathcal{I}_{i,
\ell} \hat{p}_{0, \ell}} > 0, \text{ and} \\
%
\hat{f}_{\infty, 1}(t) &= n^{-1} \sum_i \dfrac{\sum_j \mathcal{I}_{i, j}
\int_{J_j} K_h(u-t) \hat{f}_0(u)
\, \mathrm{d}u }{\sum_\ell \mathcal{I}_{i, \ell} \hat{p}_{0, \ell}} > 0
\end{align*}
Therefore, $\hat{f}_{k, 1}$, $\hat{f}_{\infty, 1} \in \mathcal{F}_1$. By
the same argument, the induction assumption, i.e.\ $\hat{f}_{k, r}$,
$\hat{f}_{\infty, r} \in \mathcal{F}_1$, immediately implies
$\hat{f}_{k, r+1}$, $\hat{f}_{\infty, r+1} \in \mathcal{F}_1$.
\end{enumerate}
%
%
% Equivalent Kernel
%
%
\item[II.]
\begin{enumerate}
\item[A.] Let $r=1$. By the triangle inequality and
Lemma \ref{lemma:integral_op1}, we have
%
\begin{align*}
& \left|\!\left| \hat{f}_{k, 1} - \hat{f}_{\infty, 1}
\right|\!\right|_1 = n^{-1} \int_{\mathcal{S}} \left| \sum_i
\dfrac{\int_{I_i} K_h(u-t) \left( \hat{g}_{0}^{1/2}(u) \,
\hat{g}_{0}^{1/2}(t) - \hat{f}_{0}^{1/2}(u) \, \hat{f}_{0}^{1/2}(t)
\right) \, \mathrm{d}u }{\int_{I_i} \hat{f}_{0}(t) \, \mathrm{d}t} \,
\right| \mathrm{d}t \\
% line2
&\hspace{0.25in} \leq n^{-1} \sum_i \int_{\mathcal{S}} \left|
\dfrac{\int_{I_i} K_h(u-t) \left( \hat{g}_{0}^{1/2}(u) \,
\hat{g}_{0}^{1/2}(t) - \hat{f}_{0}^{1/2}(u) \, \hat{f}_{0}^{1/2}(t)
\right) \, \mathrm{d}u }{\int_{I_i} \hat{f}_{0}(t) \, \mathrm{d}t} \,
\right| \mathrm{d}t \\
% line 3
&\hspace{0.25in} \leq n^{-1} \sum_i \left( \int_{I_i}
\hat{f}_{0}(t) \, \mathrm{d}t \right)^{-1}
\left\{ \int_{\mathcal{S}} \int_{\mathcal{S}} K_h(u-t)
\hat{g}_{0}^{1/2}(t) \left| \hat{g}_{0}^{1/2}(u) - \hat{f}_{0}^{1/2}(u)
\right| \, \mathrm{d}u \, \mathrm{d}t \right. \\
%
& \hspace{2in} + \left. \int_{\mathcal{S}} \int_{\mathcal{S}} K_h(u-t)
\hat{f}_{0}^{1/2}(u) \left| \hat{g}_{0}^{1/2}(t) - \hat{f}_{0}^{1/2}(t)
\right| \, \mathrm{d}u \, \mathrm{d}t \right\} \\
%%
&\hspace{0.25in} \leq n^{-1} \sum_i \left( \int_{I_i} \hat{f}_{0}(t) \,
\mathrm{d}t \right)^{-1} M_h \left( \left|\!\left| \hat{g}_{0}^{1/2}
\right|\!\right|_{1} \cdot \left|\!\left| \hat{g}_{0}^{1/2} -
\hat{f}_{0}^{1/2} \right|\!\right|_{1} + \left|\!\left|
\hat{f}_{0}^{1/2} \right|\!\right|_{1} \cdot \left|\!\left|
\hat{g}_{0}^{1/2} - \hat{f}_{0}^{1/2} \right|\!\right|_{1} \right)
\end{align*}
%
Because $\hat{g}_{0}
\stackrel{\mathcal{L}^1}{\longrightarrow} \hat{f}_0$
by Lemma \ref{lemma:piece_approx}, we have, by
continuous mapping theorem (\textsc{cmt}), $\left|\!\left|
\hat{g}_{0}^{1/2} - \hat{f}_{0}^{1/2} \right|\!\right|_{1}
\to 0$ as $\Delta \to 0$. Therefore, $\hat{f}_{k,
1} \stackrel{\mathcal{L}^1}{\longrightarrow} \hat{f}_{\infty, 1}$.

\vspace{20pt}\noindent{\it Induction Step:}
Let $c_{ri} = \left( \int_{I_i} \hat{g}_{r}(t) \, \mathrm{d}t
\right)^{-1} \left( \int_{I_i} \hat{f}_{\infty, r}(t) \,
\mathrm{d}t \right)$. Now assume that $\hat{f}_{k, r}
\stackrel{\mathcal{L}^1}{\longrightarrow}
\hat{f}_{\infty, r}$ and $\hat{f}_{k, r}$, $\hat{f}_{\infty, r} \in
\mathcal{F}_2$, which will be shown in part B. This assumption
immediately implies $c_{r i} \to 1$ for all $i$ and $ \hat{g}_{r}^{1/2}
\stackrel{\mathcal{L}^1}{\longrightarrow} \hat{f}_{\infty,
r}^{1/2}$ on $\mathcal{S}$ by \textsc{cmt}. It follows that, by Lemma
\ref{lemma:piece_approx} and \ref{lemma:integral_op1},
%
\begin{align*}
& \left|\!\left| \hat{f}_{k, r+1} - \hat{f}_{\infty, r+1}
\right|\!\right|_1 \\
% line 2
&\hspace{.25in} \leq n^{-1} \sum_i \left( \int_{I_i}
\hat{f}_{r}^{(\infty)}(t) \, \mathrm{d}t \right)^{-1}
\int_{\mathcal{S}} \int_{I_i} K_h(u-t) \left| c_{ri}
\hat{g}_{r}^{1/2}(u) \, \hat{g}_{r}^{1/2}(t) -
\hat{f}_{\infty, r}^{1/2}(u) \, \hat{f}_{\infty, r}(t) \right| \,
\mathrm{d}u \, \mathrm{d}t \\
% line 3
&\hspace{.25in} \leq n^{-1} \sum_i \left(\int_{I_i} \hat{f}_{r}(t)
\, \mathrm{d}t \right)^{-1}
\int_{\mathcal{S}} \int_{\mathcal{S}} K_h(u-t) \left\{ c_{ri}
\hat{g}_{r}^{1/2}(t) \left| \hat{g}_{r}^{1/2}(u) - \hat{f}_{\infty,
r}^{1/2}(u) \right| \right. \\
% line 4
& \hspace{2in} + \left. \hat{f}_{\infty, r}^{1/2}(u)
\hat{g}_{r}^{1/2}(t) \left| c_{ri} - 1 \right| + \hat{f}_{\infty,
r}^{1/2}(u) \left| \hat{g}_{r}^{1/2}(t) - \hat{f}_{\infty, r}(t) \right|
\right\} \, \mathrm{d}u \, \mathrm{d}t \\
% line 5
&\hspace{.25in} \leq n^{-1} \sum_i \left( \int_{I_i}
\hat{f}_{0}(t) \, \mathrm{d}t \right)^{-1}
M_h \left( |c_{ri}| \left|\!\left| \hat{g}_{r}^{1/2}
\right|\!\right|_{1} \cdot \left|\!\left| \hat{g}_{r}^{1/2} -
\hat{f}_{\infty, r}^{1/2} \right|\!\right|_{1} +|c_{ri} - 1|
\left|\!\left| \hat{g}_{r}^{1/2} \right|\!\right|_{1} \cdot
\left|\!\left| \hat{f}_{\infty, r}^{1/2} \right|\!\right|_{1} \right. \\
% line 6
& \hspace{2in} +\left. \left|\!\left| \hat{f}_{\infty, r}^{1/2}
\right|\!\right|_{1} \cdot \left|\!\left| \hat{g}_{r}^{1/2} -
\hat{f}_{\infty, r}^{1/2} \right|\!\right|_{1} \right) \to 0,
\end{align*}
%
By induction, we show that $\hat{f}_{k, r+1}(x)
\stackrel{\mathcal{L}^1}{\longrightarrow} \hat{f}_{\infty, r+1}(x)$.
%%%%%%%%%%%%%%%%
\item[B.] Let $U_0$ denote an upper bound for $\hat{f}_{0}$, thus
$\hat{g}_0$. Then it follows that
\begin{align*}
\hat{f}_{\infty, 1}(x) \leq U_0 \left(\min_i \int_{I_i}
\hat{f}_{0}(x) \, \mathrm{d}t\right)^{-1} \text{ and }
%
\hat{f}_{k, 1}(x) \leq U_0 \left(\min_i \int_{I_i}
\hat{f}_{0}(x) \, \mathrm{d}t\right)^{-1}
\end{align*}

Next, define two functions, $\hat{\zeta}_{k, 0}$ and
$\hat{\zeta}_{\infty, 0}$, to be
\[
\hat{\zeta}_{k, 0}(t) = \int_{\Re} K_h(u-t) \hat{g}_{0}^{1/2}(u) \,
\mathrm{d}u \text{ and }
\hat{\zeta}_{\infty, 0}(t) = \int_{\Re} K_h(u-t) \hat{f}_{0}^{1/2}(u) \,
\mathrm{d}u.
\]

Since $\int |\hat{f}_0|^{1/2} < \infty$, then %asksteve i.e.\ $\hat{f}_0 \in \mathcal{L}^{1/2}$, %. It follows that
$\hat{\zeta}_{k, 0}$ and $\hat{\zeta}_{\infty, 0}$ belong to
$\mathcal{F}_2$ by Lemma \ref{lemma:integral_op2}, i.e.
$$\int \hat{\zeta}_{k, 0}^{1/2} \leq M_h^{1/2} \int \hat{f}_0^{1/2} <
\infty \text{ and } \int \hat{\zeta}_{\infty, 0}^{1/2} < M_h^{1/2} \int
\hat{f}_0^{1/2} < \infty. $$ This in turn implies that that $\hat{f}_{k,
1} \in \mathcal{F}_2$ because $\hat{f}_{k, 1} >0$ for all $t \in \Re$ and
\begin{align*}
\int \hat{f}_{k, 1}^{1/2} & \leq
\int \left( \left( \min_i \int_{I_i} \hat{g}_{0}(y) \, \mathrm{d}y
\right)^{-1}
U_{0}^{1/2} \hat{\zeta}_{k, 0}(t) \right)^{1/2} \mathrm{d}t \\
& = U_0^{1/4} \left( \min_i \int_{I_i}
\hat{g}_{0}(y) \, \mathrm{d}y \right)^{-1/2}
\int \hat{\zeta}_{k, 0}^{1/2} < \infty %\leq U_0^{1/2} \left( \min_i
\int_{I_i}
%\hat{g}_{0}^{(K)}(y) \, \mathrm{d}y \right)^{-1/2}
\end{align*}
By the same argument, it is quite easy to show that $\hat{f}_{\infty, 1}
\in \mathcal{F}_2$.

\vspace{20pt}
Next, we assume that $\hat{f}_{k, r} \in \mathcal{F}_2$,
$\hat{f}_{\infty, r} \in \mathcal{F}_2$.
Two immediate consequences of this induction assumption are that there
exists a common upper bound $U_r < \infty$ for $\hat{f}_{k, r} $ and
$\hat{f}_{\infty, r} $, and that $\hat{\zeta}_{k, r}$,
$\hat{\zeta}_{\infty, r} \in \mathcal{L}^{1/2}$ according to Lemma
\ref{lemma:integral_op2}. In turn, it can be shown that
$\hat{f}_{k, r+1}$, $\hat{f}_{\infty, r+1} >0$ for all $t \in \Re$ and
%
\begin{align*}
\int \hat{f}_{k, r+1}^{1/2} & \leq
\int \left( \left( \min_i \int_{I_i} \hat{g}_{r}(y) \, \mathrm{d}y
\right)^{-1}
U_{r}^{1/2} \hat{\zeta}_{k, r}(t) \right)^{1/2} \mathrm{d}t \\
& = U_r^{1/4} \left( \min_i \int_{I_i}
\hat{g}_{0}(y) \, \mathrm{d}y \right)^{-1/2}
\int \hat{\zeta}_{k, r}^{1/2} < \infty \end{align*}

and therefore, $\hat{f}_{k, r+1}$, $\hat{f}_{\infty, r+1} \in \mathcal{F}_2 $.
\end{enumerate}
\end{enumerate}
%\end{proof}
%\newpage
%
\begin{center}
\vspace{-1.5cm}
\begin{figure} \label{f: sim_mise}
%\includegraphics[scale=0.6]{IMSEtrue.pdf}
\caption{The proposed local-EM/EMS intensity estimate achieves the lowest
overall MISE with a small bandwidth of 0.85, comparing to the
estimate in Hu, Lagakos, and Lockhart (2007), which is smoothed by
placing expected increments at the left-end, mid and right-end
points of intervals $\mathcal{J}$. Moreover, the bandwidth that the
local-EM/EMS estimate reaches the lowest MISE is close to that of a kernel
intensity estimate if data were not interval censored.}
\end{figure}
\end{center}
\end{document}
